[ 
https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paolo Castagna updated JENA-117:
--------------------------------

    Summary: A pure Java version of tdbloader2, a.k.a. tdbloader3  (was: A pure 
Java version of tdbloader2)
    
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>
>                 Key: JENA-117
>                 URL: https://issues.apache.org/jira/browse/JENA-117
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: performance, tdbloader2
>         Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in 
> replacing the UNIX sort over text files with an external sorting pure Java 
> implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
>     ThresholdPolicyCount<Tuple<Long>> policy = new 
> ThresholdPolicyCount<Tuple<Long>>(1000000);
>     SerializationFactory<Tuple<Long>> serializerFactory = new 
> TupleSerializationFactory();
>     Comparator<Tuple<Long>> comparator = new TupleComparator();
>     SortedDataBag<Tuple<Long>> sortedDataBag = new 
> SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which 
> are wrappers around DataInputStream|DataOutputStream. TupleComparator is 
> trivial.
> Preliminary results seems promising and show that the Java implementation can 
> be faster than UNIX sort since it uses smaller binary files (instead of text 
> files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is 
> available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is 
> that we could stream results directly into the BPlusTreeRewriter rather than 
> on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant 
> improvement.
> Using compression for intermediate files might help, but more experiments are 
> necessary to establish if it is worthwhile or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to