Isn't numerical sorting a fair bit slower if you need to convert from
decimal to binary representation first? The algorithm for this is quite
convoluted and don't have fixed costs - until recently there was even a bug
on some platforms were a particular string caused an infinite loop. (But
that might have been to floating points :-)

Byte-by-byte comparison without unicode should be fairly fast.. but worth
checking if "sort" is a slowdown (I didn't think it was the slowest bit of
tdbloader2)

On 20 Oct 2016 10:01 pm, "A. Soroka" <[email protected]> wrote:

> I've been tinkering with tdbloader2 and optimizing a use of it, and now
> I've got a question about the intermediate files created (data-quads.tmp
> and data-triples.tmp).
>
> They contain hexadecimal numbers that represent node IDs in tuples; the
> contents of the database to be build in tabular form. The TDB node IDs are
> 64 bit integers, if I remember correctly, and as I say, they are
> represented in these files as long hex strings. These data files are sorted
> before being packed into indexes, and that sort occurs by using plain old
> POSIX `sort`.
>
> If `sort` is the tool to be used (or at least the default, since it can be
> aliased out if appropriate) wouldn't it make more sense for those IDs to be
> decimal-radix integers, so that numeric comparison (which is often faster
> because it avoids locale machinery; even 'C' locale has some work involved)
> could be used in `sort`? To my knowledge, most `sort`s out there won't
> handle hex with numeric comparison-- only string comparison.
>
> Or am I (as I often am) missing something about how those numbers get
> used? Obviously, decimal versions of the IDs would be darn big numbers and
> less readable, but if that is the concern, would there be any objection to
> providing a switch on the utility to choose which radix and comparison
> function to use?
>
> ---
> A. Soroka
> The University of Virginia Library
>
>

Reply via email to