Looks like tdbloader2 is currently using a nice non-Unicode locale, so that's 
covered.

https://github.com/apache/jena/blob/master/apache-jena/bin/tdbloader2index#L146

---
A. Soroka
The University of Virginia Library

> On Oct 21, 2016, at 9:22 AM, Stian Soiland-Reyes <[email protected]> wrote:
> 
> Isn't numerical sorting a fair bit slower if you need to convert from
> decimal to binary representation first? The algorithm for this is quite
> convoluted and don't have fixed costs - until recently there was even a bug
> on some platforms were a particular string caused an infinite loop. (But
> that might have been to floating points :-)
> 
> Byte-by-byte comparison without unicode should be fairly fast.. but worth
> checking if "sort" is a slowdown (I didn't think it was the slowest bit of
> tdbloader2)
> 
> On 20 Oct 2016 10:01 pm, "A. Soroka" <[email protected]> wrote:
> 
>> I've been tinkering with tdbloader2 and optimizing a use of it, and now
>> I've got a question about the intermediate files created (data-quads.tmp
>> and data-triples.tmp).
>> 
>> They contain hexadecimal numbers that represent node IDs in tuples; the
>> contents of the database to be build in tabular form. The TDB node IDs are
>> 64 bit integers, if I remember correctly, and as I say, they are
>> represented in these files as long hex strings. These data files are sorted
>> before being packed into indexes, and that sort occurs by using plain old
>> POSIX `sort`.
>> 
>> If `sort` is the tool to be used (or at least the default, since it can be
>> aliased out if appropriate) wouldn't it make more sense for those IDs to be
>> decimal-radix integers, so that numeric comparison (which is often faster
>> because it avoids locale machinery; even 'C' locale has some work involved)
>> could be used in `sort`? To my knowledge, most `sort`s out there won't
>> handle hex with numeric comparison-- only string comparison.
>> 
>> Or am I (as I often am) missing something about how those numbers get
>> used? Obviously, decimal versions of the IDs would be darn big numbers and
>> less readable, but if that is the concern, would there be any objection to
>> providing a switch on the utility to choose which radix and comparison
>> function to use?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> 

Reply via email to