I haven't heard anything about such problems with straight integers (which is 
what we have here) but I may very well just not have come across it. Indeed, 
keeping a non-Unicode locale helps a great deal, and there are probably other 
places that TDB loading could go faster-- I'm just looking for low-hanging 
fruit and I am also honestly curious why text files with hex was chosen 
(instead, for example, of some very compact format with a sort algorithm in the 
Java). 

My (very rough, not carefully controlled) example with about 300Mt showed that 
sort was actually a good chunk of the index phase (as opposed to the data 
phase).  It's not obvious to me that there could be anything special about my 
data, but there might be, I suppose.

---
A. Soroka
The University of Virginia Library

> On Oct 21, 2016, at 9:22 AM, Stian Soiland-Reyes <[email protected]> wrote:
> 
> Isn't numerical sorting a fair bit slower if you need to convert from
> decimal to binary representation first? The algorithm for this is quite
> convoluted and don't have fixed costs - until recently there was even a bug
> on some platforms were a particular string caused an infinite loop. (But
> that might have been to floating points :-)
> 
> Byte-by-byte comparison without unicode should be fairly fast.. but worth
> checking if "sort" is a slowdown (I didn't think it was the slowest bit of
> tdbloader2)
> 
> On 20 Oct 2016 10:01 pm, "A. Soroka" <[email protected]> wrote:
> 
>> I've been tinkering with tdbloader2 and optimizing a use of it, and now
>> I've got a question about the intermediate files created (data-quads.tmp
>> and data-triples.tmp).
>> 
>> They contain hexadecimal numbers that represent node IDs in tuples; the
>> contents of the database to be build in tabular form. The TDB node IDs are
>> 64 bit integers, if I remember correctly, and as I say, they are
>> represented in these files as long hex strings. These data files are sorted
>> before being packed into indexes, and that sort occurs by using plain old
>> POSIX `sort`.
>> 
>> If `sort` is the tool to be used (or at least the default, since it can be
>> aliased out if appropriate) wouldn't it make more sense for those IDs to be
>> decimal-radix integers, so that numeric comparison (which is often faster
>> because it avoids locale machinery; even 'C' locale has some work involved)
>> could be used in `sort`? To my knowledge, most `sort`s out there won't
>> handle hex with numeric comparison-- only string comparison.
>> 
>> Or am I (as I often am) missing something about how those numbers get
>> used? Obviously, decimal versions of the IDs would be darn big numbers and
>> less readable, but if that is the concern, would there be any objection to
>> providing a switch on the utility to choose which radix and comparison
>> function to use?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> 

Reply via email to