tdbloader2: Java external sorting using binary files vs. UNIX sort over text
files
----------------------------------------------------------------------------------
Key: JENA-117
URL: https://issues.apache.org/jira/browse/JENA-117
Project: Jena
Issue Type: Improvement
Components: TDB
Reporter: Paolo Castagna
There is probably a significant performance improvement for tdbloader2 in
replacing the UNIX sort over text files with an external sorting pure Java
implementation.
Since JENA-99 we now have a SortedDataBag which does exactly that.
ThresholdPolicyCount<Tuple<Long>> policy = new
ThresholdPolicyCount<Tuple<Long>>(1000000);
SerializationFactory<Tuple<Long>> serializerFactory = new
TupleSerializationFactory();
Comparator<Tuple<Long>> comparator = new TupleComparator();
SortedDataBag<Tuple<Long>> sortedDataBag = new
SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
TupleSerializationFactory greates TupleInputStream|TupleOutputStream which are
wrappers around DataInputStream|DataOutputStream. TupleComparator is trivial.
Preliminary results seems promising and show that the Java implementation can
be faster than UNIX sort since it uses smaller binary files (instead of text
files) and it does comparisons of long values rather than strings.
An example of ExternalSort which compare SortedDataBag vs. UNIX sort is
available here:
https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
A further advantage in doing the sorting with Java rather than UNIX sort is
that we could stream results directly into the BPlusTreeRewriter rather than on
disk and then reading them from disk into the BPlusTreeRewriter.
I've not done an experiment yet to see if this is actually a significant
improvement.
Using compression for intermediate files might help, but more experiments are
necessary to establish if it is worthwhile or not.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira