[ 
https://issues.apache.org/jira/browse/JENA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Vesse closed JENA-820.
--------------------------

> Blank Node output under Hadoop can cause identifiers to diverge in 
> multi-stage pipelines
> ----------------------------------------------------------------------------------------
>
>                 Key: JENA-820
>                 URL: https://issues.apache.org/jira/browse/JENA-820
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RDF Tools for Hadoop
>            Reporter: Rob Vesse
>            Assignee: Rob Vesse
>             Fix For: Jena 2.12.2
>
>
> In writing up the documentation on the RDF Tools for Hadoop and enumerating 
> the possible issues that blank nodes imply I discovered an issue that I 
> hadn't previously considered.
> For a single job the input and output formats all ensure that blank nodes are 
> consistently given the same identifiers if they had the same syntactic ID and 
> were in the same file.  This is done even when a file is being read in 
> multiple chunks by multiple map tasks.  However by its nature each reduce 
> task will create an output file so potentially you can end up with blank 
> nodes spread over multiple files.
> However if we then read these files into a subsequent job the blank nodes may 
> now be spread across multiple files so even though they were the same node 
> originally our allocation policy will cause the identifiers to diverge and 
> become distinct blank nodes which is incorrect behaviour.
> Since there is no clear universal fix for this what I am considering doing is 
> instead introducing a configuration setting that will allow the file path to 
> be ignored for the purpose of blank node identifier allocations within a job. 
>  This will mean that identifiers are purely allocated on the basis of the Job 
> ID and thus the same syntactic ID in any file will result in the same blank 
> node identifier.  As the user will hopefully will have left this turned off 
> for the first job even if we start with the same syntactic ID but in 
> different files the normal allocation policy for the first job should ensure 
> unique IDs for the later jobs.
> My next step on this is to implement a failing unit test (and then 
> temporarily ignore it) which demonstrates this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to