[
https://issues.apache.org/jira/browse/JENA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rob Vesse closed JENA-820.
--------------------------
> Blank Node output under Hadoop can cause identifiers to diverge in
> multi-stage pipelines
> ----------------------------------------------------------------------------------------
>
> Key: JENA-820
> URL: https://issues.apache.org/jira/browse/JENA-820
> Project: Apache Jena
> Issue Type: Improvement
> Components: RDF Tools for Hadoop
> Reporter: Rob Vesse
> Assignee: Rob Vesse
> Fix For: Jena 2.12.2
>
>
> In writing up the documentation on the RDF Tools for Hadoop and enumerating
> the possible issues that blank nodes imply I discovered an issue that I
> hadn't previously considered.
> For a single job the input and output formats all ensure that blank nodes are
> consistently given the same identifiers if they had the same syntactic ID and
> were in the same file. This is done even when a file is being read in
> multiple chunks by multiple map tasks. However by its nature each reduce
> task will create an output file so potentially you can end up with blank
> nodes spread over multiple files.
> However if we then read these files into a subsequent job the blank nodes may
> now be spread across multiple files so even though they were the same node
> originally our allocation policy will cause the identifiers to diverge and
> become distinct blank nodes which is incorrect behaviour.
> Since there is no clear universal fix for this what I am considering doing is
> instead introducing a configuration setting that will allow the file path to
> be ignored for the purpose of blank node identifier allocations within a job.
> This will mean that identifiers are purely allocated on the basis of the Job
> ID and thus the same syntactic ID in any file will result in the same blank
> node identifier. As the user will hopefully will have left this turned off
> for the first job even if we start with the same syntactic ID but in
> different files the normal allocation policy for the first job should ensure
> unique IDs for the later jobs.
> My next step on this is to implement a failing unit test (and then
> temporarily ignore it) which demonstrates this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)