[
https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082309#comment-13082309
]
Andy Seaborne commented on JENA-85:
-----------------------------------
BindingIO doc updated. Code updated.
For now, I've put in an encoding for labela (which is taken from N-Triples)
The first latter of the label is "B" (this ensures a letter is first)
Any character outside A-Za-z0-9 is encoded as Xnn where nn is the byte value
X is encoded as XX.
The Unicode implications of this need properly sorting out but it will work for
all Jena-allocated blank nodes for now.
> Common bindings I/O
> -------------------
>
> Key: JENA-85
> URL: https://issues.apache.org/jira/browse/JENA-85
> Project: Jena
> Issue Type: New Feature
> Components: ARQ
> Reporter: Paolo Castagna
> Attachments: JENA-85-BindingOutputStream-Changes.patch,
> JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and
> read back, bindings. They use different serializations. A shared "bindings
> I/O" would mean all activities could use one, tuned, set of serialization and
> I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.
> The byte arry uses lengh-denoted byte arrays within the bindings. I/O is
> done using Data(In|Out)putStream, specifically. putInt/getInt() and
> put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the
> per-row serialization as (var,Turtle string form) pairs. It uses a null for
> no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation
> based on a binding endcoded as (var, Turtle term). End of row is denoted by a
> DOT. It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets. In this
> form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in
> force. Position in the row determines which variable is bound to which
> variable (=> compression of variable names). Turtle-style prefixes can be
> used (=> compression for IRIs) and the value of a slot in a row can "same as
> the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness
> against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword. End on DOT.
> The directives are:
> PREFIX : <http://example> .
> Like Turtles, except keyword based to fit with being a keyword-driven
> mini-language.
> VARS ?x ?y .
> Set the variables in force for subsequent rows,
> until the next VARS directive.
> We need VARS because it's not always possible to determine all
> the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed
> names and short forms for numbers (more compression). In addition STAR ("*")
> means "same term as the row before" and DASH ("-") means undef. Don't use *
> for - from previous row.
> Rows end in DOT. Preferred style is one space after each term. This makes
> writing safe.
> Terms can be written without intermediate copies (except local name
> processing) or buffers. The OutputLangUtils does not do this currently but
> it should.
> For presentation reasons only, blank lines are allowed (this would all get
> lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form,
> rather than a delimited text form, is unlikely to give much advantage but
> can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway. A binary tokenizer and
> binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira