[
https://issues.apache.org/jira/browse/JENA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974185#comment-13974185
]
Rob Vesse commented on JENA-675:
--------------------------------
The specific problem with blank node labels in Hadoop RDF Tools relates to how
the {{OutputFormat}} implementations there use the {{RiotWriter}} APIs
Some of the implementations work in batch style where they store up some number
of quads in a temporary {{Graph}} instance and then write them out once the
batch threshold is reached. This avoids exhausting memory of the Hadoop jobs
while still getting some useful data compressions. However the problem is that
there is no control over label policy so if two batches contain different blank
nodes the writer can quite legitimately prettify them to the same label in the
output data and unintentionally collapse them into a single node.
The key syntax is actually Turtle since it is the only syntax where we really
want to have this batching behaviour to get reasonable compression. N-Triples
and N-Quads can and are output using the {{NodeFormatter}} API directly which
does allow for specifying a {{NodeToLabel}} policy so those syntaxes don't have
the problem. And RDF/XML and RDF/JSON require complete graphs to generate
valid data so they try and cache everything (at the risk of memory exhaustion)
and do a single write once all output has been produced thus avoiding any
problem with blank node labelling for those syntaxes.
> Add and use a WriterProfile API
> -------------------------------
>
> Key: JENA-675
> URL: https://issues.apache.org/jira/browse/JENA-675
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ, RIOT
> Affects Versions: Jena 2.11.1
> Reporter: Rob Vesse
>
> Currently we have a {{ParserProfile}} which allows specifying certain aspects
> of input behaviour such as Prologue and Label to Node ID
> However we don't have a corresponding {{WriterProfile}} API, we actually have
> a class called {{OutputProfile}} but this is never actually used anywhere.
> This would be particularly useful for languages that rely on the
> {{NodeFormatter}} API where we can find comments such as the following:
> {quote}
> // Replace with a single "OutputPolicy"
> {quote}
> The lack of this API means we don't provide users any ability to do things
> like control how blank node IDs are allocated. And existing functionality we
> do give them like providing a set of namespaces and base URI to use for
> serialisation needs to be folded into this API.
> I know of two places where this is currently causing issues:
> * In the incoming Hadoop RDF Tools code (JENA-666) many output formats
> currently mangle the data when outputting blank nodes because they can't
> share a {{NodeToLabel}} instance over multiple writer runs.
> * In an internal bug at Cray we're seeing a situation where different code
> paths lead to different presentation of blank nodes and we have no APIs to
> allow us to control this presentation.
--
This message was sent by Atlassian JIRA
(v6.2#6252)