[
https://issues.apache.org/jira/browse/JENA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227850#comment-14227850
]
Rob Vesse commented on JENA-820:
--------------------------------
Yes I already use {{ParserProfile}} for line based inputs, the current
behaviour is to assign a seed based on the combination of Job ID and File path
which yields file scoped IDs that are consistent within a Job (though not
necessarily reproducible by subsequent jobs). What would be good to know is
how to pass a {{ParserProfile}} down when using the {{RDFDataMgr.parse()}} type
operations since I can't see an obvious way to do this right now?
I am loath to use {{<_:label>}} unless it is the only viable solution since it
goes outside standard RDF and makes the output non-portable.
Using the Thrift output format is a good workaround especially for multi-stage
pipelines since it is very efficient to read and write.
There is also the issue that you don't want this behaviour on by default, if I
start with two files (from some external source) that have equivalent blank
node labels they should be treated as file scoped identifiers and assigned
different identifiers. You only need/want this behaviour on when you know the
output is from a previous job and that it may be spread over multiple files so
blank nodes with same labels may be spread over multiple files.
> Blank Node output under Hadoop can cause identifiers to diverge in
> multi-stage pipelines
> ----------------------------------------------------------------------------------------
>
> Key: JENA-820
> URL: https://issues.apache.org/jira/browse/JENA-820
> Project: Apache Jena
> Issue Type: Improvement
> Components: RDF Tools for Hadoop
> Reporter: Rob Vesse
> Assignee: Rob Vesse
> Fix For: Jena 2.12.2
>
>
> In writing up the documentation on the RDF Tools for Hadoop and enumerating
> the possible issues that blank nodes imply I discovered an issue that I
> hadn't previously considered.
> For a single job the input and output formats all ensure that blank nodes are
> consistently given the same identifiers if they had the same syntactic ID and
> were in the same file. This is done even when a file is being read in
> multiple chunks by multiple map tasks. However by its nature each reduce
> task will create an output file so potentially you can end up with blank
> nodes spread over multiple files.
> However if we then read these files into a subsequent job the blank nodes may
> now be spread across multiple files so even though they were the same node
> originally our allocation policy will cause the identifiers to diverge and
> become distinct blank nodes which is incorrect behaviour.
> Since there is no clear universal fix for this what I am considering doing is
> instead introducing a configuration setting that will allow the file path to
> be ignored for the purpose of blank node identifier allocations within a job.
> This will mean that identifiers are purely allocated on the basis of the Job
> ID and thus the same syntactic ID in any file will result in the same blank
> node identifier. As the user will hopefully will have left this turned off
> for the first job even if we start with the same syntactic ID but in
> different files the normal allocation policy for the first job should ensure
> unique IDs for the later jobs.
> My next step on this is to implement a failing unit test (and then
> temporarily ignore it) which demonstrates this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)