Looks like tdbloader2 is currently using a nice non-Unicode locale, so that's covered.
https://github.com/apache/jena/blob/master/apache-jena/bin/tdbloader2index#L146 --- A. Soroka The University of Virginia Library > On Oct 21, 2016, at 9:22 AM, Stian Soiland-Reyes <[email protected]> wrote: > > Isn't numerical sorting a fair bit slower if you need to convert from > decimal to binary representation first? The algorithm for this is quite > convoluted and don't have fixed costs - until recently there was even a bug > on some platforms were a particular string caused an infinite loop. (But > that might have been to floating points :-) > > Byte-by-byte comparison without unicode should be fairly fast.. but worth > checking if "sort" is a slowdown (I didn't think it was the slowest bit of > tdbloader2) > > On 20 Oct 2016 10:01 pm, "A. Soroka" <[email protected]> wrote: > >> I've been tinkering with tdbloader2 and optimizing a use of it, and now >> I've got a question about the intermediate files created (data-quads.tmp >> and data-triples.tmp). >> >> They contain hexadecimal numbers that represent node IDs in tuples; the >> contents of the database to be build in tabular form. The TDB node IDs are >> 64 bit integers, if I remember correctly, and as I say, they are >> represented in these files as long hex strings. These data files are sorted >> before being packed into indexes, and that sort occurs by using plain old >> POSIX `sort`. >> >> If `sort` is the tool to be used (or at least the default, since it can be >> aliased out if appropriate) wouldn't it make more sense for those IDs to be >> decimal-radix integers, so that numeric comparison (which is often faster >> because it avoids locale machinery; even 'C' locale has some work involved) >> could be used in `sort`? To my knowledge, most `sort`s out there won't >> handle hex with numeric comparison-- only string comparison. >> >> Or am I (as I often am) missing something about how those numbers get >> used? Obviously, decimal versions of the IDs would be darn big numbers and >> less readable, but if that is the concern, would there be any objection to >> providing a switch on the utility to choose which radix and comparison >> function to use? >> >> --- >> A. Soroka >> The University of Virginia Library >> >>
