Parser using Platform encoding instead of UTF-8

Rupert Westenthaler (Updated) (JIRA) Thu, 13 Oct 2011 14:29:38 -0700

     [ 
https://issues.apache.org/jira/browse/CLEREZZA-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rupert Westenthaler updated CLEREZZA-643:
-----------------------------------------

    Attachment: rdf.rdfjson-arrays.sort_based_serializer_and_UTF-8.patch

To solve this I created an alternative implementation that

* copies the Triples to an Array
* uses Arrays.sort with an comparator based on the Subject to sort the triples
* iterates over the triples until the subjects changes while storing 
predicate/object values in an intermediate map
* directly writes the JSON data for each subject to a buffered writer. It dose 
NOT create the JSON objects for all sub jets of the serialized TripleCollection

This implementation serializes a Graph with 100k triples in about 1sec on my 
machine.
The source also includes a lot of comments about different approaches. I kept 
such comments mainly to document the different approaches I tried during 
testing and optimizing.

I also implemented a method (RdfJsonSerializerProviderTest#testBigGraph()) that 
can create a RDF graph (mix of URIs, bNodes, TypedLiterals and PlainLiterals) 
that can be used for testing. Currently the generated graph is 10 times 
serialized to get rid of JIT compilation side effects. 
Currently the @Test annotation of this test is serialized because it is more 
intended to test performance related implications of different implementations 
than to test the validity of the generated json+rdf.

Two final notes: 

* Sorting the triples of the parsed collection is only the second best way. It 
would be even better if one could get a sorted iterator directly from a triple 
collection. e.g. Jena TDB by default provides an iterator based on the SPO 
index that happens to be sorted based on subjects.
* The Apache Stanbol JSON-LD serializer referenced by CLEREZZA-642 suffers also 
from similar problems as the current JSON+RDF serializer. 

                
> Weak Performance of "application/json+rdf" serializer on big 
> TripleCollections and Serialzer/Parser using Platform encoding instead of 
> UTF-8
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLEREZZA-643
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-643
>             Project: Clerezza
>          Issue Type: Improvement
>            Reporter: Rupert Westenthaler
>         Attachments: rdf.rdfjson-arrays.sort_based_serializer_and_UTF-8.patch
>
>
> Both the "application/json+rdf" serializer and parser use platform specific 
> encodings instead of UTF-8.
> In addition the serializer suffers from very poor performance on big graphs 
> (at least when using SimpleMGrpah)
> After some digging in the Code I came to the conclusion that this is because 
> of the use of multiple TripleCollection.filter(..) calls fist to filter all 
> predicates for an subject and than all objects for each subject/predicate 
> combination. A trying to serialize a graph with 50k triples ended in several 
> minutes 100% CPU.
> With the next comment I will provide a patch with an implementation based on 
> a sorted array of the triples. With this method one can serialize graphs with 
> 100k in about 1sec. This patch also changes encoding to UTF-8.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CLEREZZA-643) Weak Performance of "application/json+rdf" serializer on big TripleCollections and Serialzer/Parser using Platform encoding instead of UTF-8

Reply via email to