[jira] [Commented] (JENA-820) Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines

Andy Seaborne (JIRA) Sat, 29 Nov 2014 02:33:42 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228709#comment-14228709
 ]


Andy Seaborne commented on JENA-820:
------------------------------------

I'm glad you have found {{ParserProfile}} and aren't wasting time on mechanism 
that it can help with already.

{{RDFDataMgr.createReader}} then {{.setParserProfile}} should work.  If it 
doesn't please raise a JIRA.  There is a risk this is a path-less-travelled and 
while it is supposed to work with all readers maybe it doesn't.  
{{WriterGraphRIOT}}/{{WriterDatasetRIOT}} should have the complementary way to 
set the label to node mapper.  That is missing -- JENA-821.

Let's define the required system contracts in a set of test cases.

I'd like to make is to more formally define the blank node label in Jena as a 
UUID.  That is, a global unique identifier that you can rely on as being safe 
and isn't a URI for this intra-system use case. The fact that identifiers that 
match the UUID syntax can be stored compactly as 2 longs is a bonus.


At some level, moving a blank node across machines and wanting it to match up 
with the same blank node that has travelled a different path "inside the graph" 
is outside RDF standards, at least the syntaxes.  The standard syntaxes, only 
talk about what happens at the boundary of the graph, not issues within the 
graph. Your example of where you don't want it is, to my way of thinking about 
it, the boundary. Kong needs both.

That is very painful when "the graph" is across different machines or any place 
across boundaries where references to concrete objects don't work. 

The RDF-WG work on skolemization gives reference across the graph boundary.  
That's the nearest to portability. The use case is a reference across the web 
that can come back to the original place (the host name fixes that and defines 
the scope of where it can be matched). 

{{<_:label>}}  is a different use case - identity within a distributed system 
(it also predate RDF-WG by several years; it's becoming more important as we go 
multi-machine).  The point is to have blank node syntax that parseable syntax 
but an illegal URI.  {{_}} is not a valid scheme name which must start 
{{[a-z]}}.



> Blank Node output under Hadoop can cause identifiers to diverge in 
> multi-stage pipelines
> ----------------------------------------------------------------------------------------
>
>                 Key: JENA-820
>                 URL: https://issues.apache.org/jira/browse/JENA-820
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RDF Tools for Hadoop
>            Reporter: Rob Vesse
>            Assignee: Rob Vesse
>             Fix For: Jena 2.12.2
>
>
> In writing up the documentation on the RDF Tools for Hadoop and enumerating 
> the possible issues that blank nodes imply I discovered an issue that I 
> hadn't previously considered.
> For a single job the input and output formats all ensure that blank nodes are 
> consistently given the same identifiers if they had the same syntactic ID and 
> were in the same file.  This is done even when a file is being read in 
> multiple chunks by multiple map tasks.  However by its nature each reduce 
> task will create an output file so potentially you can end up with blank 
> nodes spread over multiple files.
> However if we then read these files into a subsequent job the blank nodes may 
> now be spread across multiple files so even though they were the same node 
> originally our allocation policy will cause the identifiers to diverge and 
> become distinct blank nodes which is incorrect behaviour.
> Since there is no clear universal fix for this what I am considering doing is 
> instead introducing a configuration setting that will allow the file path to 
> be ignored for the purpose of blank node identifier allocations within a job. 
>  This will mean that identifiers are purely allocated on the basis of the Job 
> ID and thus the same syntactic ID in any file will result in the same blank 
> node identifier.  As the user will hopefully will have left this turned off 
> for the first job even if we start with the same syntactic ID but in 
> different files the normal allocation policy for the first job should ensure 
> unique IDs for the later jobs.
> My next step on this is to implement a failing unit test (and then 
> temporarily ignore it) which demonstrates this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JENA-820) Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines

Reply via email to