Isn't numerical sorting a fair bit slower if you need to convert from decimal to binary representation first? The algorithm for this is quite convoluted and don't have fixed costs - until recently there was even a bug on some platforms were a particular string caused an infinite loop. (But that might have been to floating points :-)
Byte-by-byte comparison without unicode should be fairly fast.. but worth checking if "sort" is a slowdown (I didn't think it was the slowest bit of tdbloader2) On 20 Oct 2016 10:01 pm, "A. Soroka" <[email protected]> wrote: > I've been tinkering with tdbloader2 and optimizing a use of it, and now > I've got a question about the intermediate files created (data-quads.tmp > and data-triples.tmp). > > They contain hexadecimal numbers that represent node IDs in tuples; the > contents of the database to be build in tabular form. The TDB node IDs are > 64 bit integers, if I remember correctly, and as I say, they are > represented in these files as long hex strings. These data files are sorted > before being packed into indexes, and that sort occurs by using plain old > POSIX `sort`. > > If `sort` is the tool to be used (or at least the default, since it can be > aliased out if appropriate) wouldn't it make more sense for those IDs to be > decimal-radix integers, so that numeric comparison (which is often faster > because it avoids locale machinery; even 'C' locale has some work involved) > could be used in `sort`? To my knowledge, most `sort`s out there won't > handle hex with numeric comparison-- only string comparison. > > Or am I (as I often am) missing something about how those numbers get > used? Obviously, decimal versions of the IDs would be darn big numbers and > less readable, but if that is the concern, would there be any objection to > providing a switch on the utility to choose which radix and comparison > function to use? > > --- > A. Soroka > The University of Virginia Library > >
