[jira] [Commented] (JENA-675) Add and use a WriterProfile API

Rob Vesse (JIRA) Fri, 18 Apr 2014 09:06:06 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974185#comment-13974185
 ]


Rob Vesse commented on JENA-675:
--------------------------------

The specific problem with blank node labels in Hadoop RDF Tools relates to how 
the {{OutputFormat}} implementations there use the {{RiotWriter}} APIs

Some of the implementations work in batch style where they store up some number 
of quads in a temporary {{Graph}} instance and then write them out once the 
batch threshold is reached.  This avoids exhausting memory of the Hadoop jobs 
while still getting some useful data compressions.  However the problem is that 
there is no control over label policy so if two batches contain different blank 
nodes the writer can quite legitimately prettify them to the same label in the 
output data and unintentionally collapse them into a single node.

The key syntax is actually Turtle since it is the only syntax where we really 
want to have this batching behaviour to get reasonable compression. N-Triples 
and N-Quads can and are output using the {{NodeFormatter}} API directly which 
does allow for specifying a {{NodeToLabel}} policy so those syntaxes don't have 
the problem.  And RDF/XML and RDF/JSON require complete graphs to generate 
valid data so they try and cache everything (at the risk of memory exhaustion) 
and do a single write once all output has been produced thus avoiding any 
problem with blank node labelling for those syntaxes.

> Add and use a WriterProfile API
> -------------------------------
>
>                 Key: JENA-675
>                 URL: https://issues.apache.org/jira/browse/JENA-675
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ, RIOT
>    Affects Versions: Jena 2.11.1
>            Reporter: Rob Vesse
>
> Currently we have a {{ParserProfile}} which allows specifying certain aspects 
> of input behaviour such as Prologue and Label to Node ID
> However we don't have a corresponding {{WriterProfile}} API, we actually have 
> a class called {{OutputProfile}} but this is never actually used anywhere.
> This would be particularly useful for languages that rely on the 
> {{NodeFormatter}} API where we can find comments such as the following:
> {quote}
> // Replace with a single "OutputPolicy"
> {quote}
> The lack of this API means we don't provide users any ability to do things 
> like control how blank node IDs are allocated.  And existing functionality we 
> do give them like providing a set of namespaces and base URI to use for 
> serialisation needs to be folded into this API.
> I know of two places where this is currently causing issues:
> * In the incoming Hadoop RDF Tools code (JENA-666) many output formats 
> currently mangle the data when outputting blank nodes because they can't 
> share a {{NodeToLabel}} instance over multiple writer runs.
> * In an internal bug at Cray we're seeing a situation where different code 
> paths lead to different presentation of blank nodes and we have no APIs to 
> allow us to control this presentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (JENA-675) Add and use a WriterProfile API

Reply via email to