[ 
https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082309#comment-13082309
 ] 

Andy Seaborne commented on JENA-85:
-----------------------------------

BindingIO doc updated.  Code updated.

For now, I've put in an encoding for labela (which is taken from N-Triples)

The first latter of the label is "B" (this ensures a letter is first)
Any character outside A-Za-z0-9 is encoded as Xnn where nn is the byte value
X is encoded as XX.

The Unicode implications of this need properly sorting out but it will work for 
all Jena-allocated blank nodes for now.


> Common bindings I/O
> -------------------
>
>                 Key: JENA-85
>                 URL: https://issues.apache.org/jira/browse/JENA-85
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, 
> JENA-85-Blank-Node-Test.patch, JENA-85-DecodeBlankNodeLabels.patch
>
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and 
> read back, bindings.  They use different serializations.  A shared "bindings 
> I/O" would mean all activities could use one, tuned, set of serialization and 
> I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  
> The byte arry uses lengh-denoted byte arrays within the bindings.  I/O is 
> done using Data(In|Out)putStream, specifically. putInt/getInt() and 
> put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the 
> per-row serialization as (var,Turtle string form) pairs.  It uses a null for 
> no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation 
> based on a binding endcoded as (var, Turtle term). End of row is denoted by a 
> DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this 
> form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in 
> force.  Position in the row determines which variable is bound to which 
> variable (=> compression of variable names).  Turtle-style prefixes can be 
> used (=> compression for IRIs) and the value of a slot in a row can "same as 
> the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness 
> against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven 
> mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed 
> names and short forms for numbers (more compression).  In addition STAR ("*") 
> means "same term as the row before" and DASH ("-") means undef.  Don't use * 
> for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes 
> writing safe.
> Terms can be written without intermediate copies (except local name 
> processing) or buffers.  The OutputLangUtils does not do this currently but 
> it should.
> For presentation reasons only, blank lines are allowed (this would all get 
> lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, 
> rather than a delimited text form, is unlikely to give much advantage but 
> can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and 
> binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to