[ https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paolo Castagna updated JENA-117: -------------------------------- Summary: A pure Java version of tdbloader2, a.k.a. tdbloader3 (was: A pure Java version of tdbloader2) > A pure Java version of tdbloader2, a.k.a. tdbloader3 > ---------------------------------------------------- > > Key: JENA-117 > URL: https://issues.apache.org/jira/browse/JENA-117 > Project: Apache Jena > Issue Type: Improvement > Components: TDB > Reporter: Paolo Castagna > Assignee: Paolo Castagna > Priority: Minor > Labels: performance, tdbloader2 > Attachments: TDB_JENA-117_r1171714.patch > > > There is probably a significant performance improvement for tdbloader2 in > replacing the UNIX sort over text files with an external sorting pure Java > implementation. > Since JENA-99 we now have a SortedDataBag which does exactly that. > ThresholdPolicyCount<Tuple<Long>> policy = new > ThresholdPolicyCount<Tuple<Long>>(1000000); > SerializationFactory<Tuple<Long>> serializerFactory = new > TupleSerializationFactory(); > Comparator<Tuple<Long>> comparator = new TupleComparator(); > SortedDataBag<Tuple<Long>> sortedDataBag = new > SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator); > TupleSerializationFactory greates TupleInputStream|TupleOutputStream which > are wrappers around DataInputStream|DataOutputStream. TupleComparator is > trivial. > Preliminary results seems promising and show that the Java implementation can > be faster than UNIX sort since it uses smaller binary files (instead of text > files) and it does comparisons of long values rather than strings. > An example of ExternalSort which compare SortedDataBag vs. UNIX sort is > available here: > https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java > A further advantage in doing the sorting with Java rather than UNIX sort is > that we could stream results directly into the BPlusTreeRewriter rather than > on disk and then reading them from disk into the BPlusTreeRewriter. > I've not done an experiment yet to see if this is actually a significant > improvement. > Using compression for intermediate files might help, but more experiments are > necessary to establish if it is worthwhile or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira