tdbloader2: Java external sorting using binary files vs. UNIX sort over text 
files
----------------------------------------------------------------------------------

                 Key: JENA-117
                 URL: https://issues.apache.org/jira/browse/JENA-117
             Project: Jena
          Issue Type: Improvement
          Components: TDB
            Reporter: Paolo Castagna


There is probably a significant performance improvement for tdbloader2 in 
replacing the UNIX sort over text files with an external sorting pure Java 
implementation.
Since JENA-99 we now have a SortedDataBag which does exactly that.

    ThresholdPolicyCount<Tuple<Long>> policy = new 
ThresholdPolicyCount<Tuple<Long>>(1000000);
    SerializationFactory<Tuple<Long>> serializerFactory = new 
TupleSerializationFactory();
    Comparator<Tuple<Long>> comparator = new TupleComparator();
    SortedDataBag<Tuple<Long>> sortedDataBag = new 
SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);

TupleSerializationFactory greates TupleInputStream|TupleOutputStream which are 
wrappers around DataInputStream|DataOutputStream. TupleComparator is trivial.

Preliminary results seems promising and show that the Java implementation can 
be faster than UNIX sort since it uses smaller binary files (instead of text 
files) and it does comparisons of long values rather than strings.
An example of ExternalSort which compare SortedDataBag vs. UNIX sort is 
available here:
https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java

A further advantage in doing the sorting with Java rather than UNIX sort is 
that we could stream results directly into the BPlusTreeRewriter rather than on 
disk and then reading them from disk into the BPlusTreeRewriter.
I've not done an experiment yet to see if this is actually a significant 
improvement.

Using compression for intermediate files might help, but more experiments are 
necessary to establish if it is worthwhile or not.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to