[
https://issues.apache.org/jira/browse/JENA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001132#comment-14001132
]
Andy Seaborne commented on JENA-675:
------------------------------------
A first experiment add {{OutputPolicy}} directly to the
{{WriterGraphRIOT}}/{{WriterDatasetRIOT}} interface as a general capability for
writers, didn't work out very nicely. I found that I was seeing changes across
all output in RIOT but much of it is not useful; I feel it is placing
responsibility for setup (often dictated by the standard for writing a single
complete graph) away from the language itself.
An alternative approach is to have the capability for details output setup
specifically to write fragments of graphs (e.g. optionally not repeat the
prefixes, carry the NodeToLabel map across, not work on whole graph pretty
formatting).
One example : writing a graph needs wring prefixes, but writing a fragment of
graph,might not.
There already are output formats that do fragment based wring for whole graphs;
{{RDFFormat.TURTLE_FLAT}} and {{RDFFormat.TURTLE_BLOCKS}} (similar TriG forms)
using {{WriterStreamRDF???}} classes. They write the preamble then write
fragments.
N-Quads and N-Triples use {{WriterStreamRDFTuples}}. In this case, there is no
preamble.
Would this be a reasonable place to build the output functions needed for
RDF-hadoop?
Related: instead of a {{Graph}} for buffering, what about a {{StreamRDF}}?
(c.f. {{WriterStreamRDFBatched}} with a policy of same-subject, same-graph;
there is scope for lots of different buffering strategies here including size
based clumping, and common subjects within clump).
> Add and use a WriterProfile API
> -------------------------------
>
> Key: JENA-675
> URL: https://issues.apache.org/jira/browse/JENA-675
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ, RIOT
> Affects Versions: Jena 2.11.1
> Reporter: Rob Vesse
> Assignee: Andy Seaborne
>
> Currently we have a {{ParserProfile}} which allows specifying certain aspects
> of input behaviour such as Prologue and Label to Node ID
> However we don't have a corresponding {{WriterProfile}} API, we actually have
> a class called {{OutputProfile}} but this is never actually used anywhere.
> This would be particularly useful for languages that rely on the
> {{NodeFormatter}} API where we can find comments such as the following:
> {quote}
> // Replace with a single "OutputPolicy"
> {quote}
> The lack of this API means we don't provide users any ability to do things
> like control how blank node IDs are allocated. And existing functionality we
> do give them like providing a set of namespaces and base URI to use for
> serialisation needs to be folded into this API.
> I know of two places where this is currently causing issues:
> * In the incoming Hadoop RDF Tools code (JENA-666) many output formats
> currently mangle the data when outputting blank nodes because they can't
> share a {{NodeToLabel}} instance over multiple writer runs.
> * In an internal bug at Cray we're seeing a situation where different code
> paths lead to different presentation of blank nodes and we have no APIs to
> allow us to control this presentation.
--
This message was sent by Atlassian JIRA
(v6.2#6252)