[jira] [Commented] (JENA-820) Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines

Rob Vesse (JIRA) Thu, 27 Nov 2014 09:35:33 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227850#comment-14227850
 ]


Rob Vesse commented on JENA-820:
--------------------------------

Yes I already use {{ParserProfile}} for line based inputs, the current 
behaviour is to assign a seed based on the combination of Job ID and File path 
which yields file scoped IDs that are consistent within a Job (though not 
necessarily reproducible by subsequent jobs).  What would be good to know is 
how to pass a {{ParserProfile}} down when using the {{RDFDataMgr.parse()}} type 
operations since I can't see an obvious way to do this right now?

I am loath to use {{<_:label>}} unless it is the only viable solution since it 
goes outside standard RDF and makes the output non-portable.

Using the Thrift output format is a good workaround especially for multi-stage 
pipelines since it is very efficient to read and write.

There is also the issue that you don't want this behaviour on by default, if I 
start with two files (from some external source) that have equivalent blank 
node labels they should be treated as file scoped identifiers and assigned 
different identifiers.  You only need/want this behaviour on when you know the 
output is from a previous job and that it may be spread over multiple files so 
blank nodes with same labels may be spread over multiple files.

> Blank Node output under Hadoop can cause identifiers to diverge in 
> multi-stage pipelines
> ----------------------------------------------------------------------------------------
>
>                 Key: JENA-820
>                 URL: https://issues.apache.org/jira/browse/JENA-820
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RDF Tools for Hadoop
>            Reporter: Rob Vesse
>            Assignee: Rob Vesse
>             Fix For: Jena 2.12.2
>
>
> In writing up the documentation on the RDF Tools for Hadoop and enumerating 
> the possible issues that blank nodes imply I discovered an issue that I 
> hadn't previously considered.
> For a single job the input and output formats all ensure that blank nodes are 
> consistently given the same identifiers if they had the same syntactic ID and 
> were in the same file.  This is done even when a file is being read in 
> multiple chunks by multiple map tasks.  However by its nature each reduce 
> task will create an output file so potentially you can end up with blank 
> nodes spread over multiple files.
> However if we then read these files into a subsequent job the blank nodes may 
> now be spread across multiple files so even though they were the same node 
> originally our allocation policy will cause the identifiers to diverge and 
> become distinct blank nodes which is incorrect behaviour.
> Since there is no clear universal fix for this what I am considering doing is 
> instead introducing a configuration setting that will allow the file path to 
> be ignored for the purpose of blank node identifier allocations within a job. 
>  This will mean that identifiers are purely allocated on the basis of the Job 
> ID and thus the same syntactic ID in any file will result in the same blank 
> node identifier.  As the user will hopefully will have left this turned off 
> for the first job even if we start with the same syntactic ID but in 
> different files the normal allocation policy for the first job should ensure 
> unique IDs for the later jobs.
> My next step on this is to implement a failing unit test (and then 
> temporarily ignore it) which demonstrates this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JENA-820) Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines

Reply via email to