I've been tinkering with tdbloader2 and optimizing a use of it, and now I've got a question about the intermediate files created (data-quads.tmp and data-triples.tmp).
They contain hexadecimal numbers that represent node IDs in tuples; the contents of the database to be build in tabular form. The TDB node IDs are 64 bit integers, if I remember correctly, and as I say, they are represented in these files as long hex strings. These data files are sorted before being packed into indexes, and that sort occurs by using plain old POSIX `sort`. If `sort` is the tool to be used (or at least the default, since it can be aliased out if appropriate) wouldn't it make more sense for those IDs to be decimal-radix integers, so that numeric comparison (which is often faster because it avoids locale machinery; even 'C' locale has some work involved) could be used in `sort`? To my knowledge, most `sort`s out there won't handle hex with numeric comparison-- only string comparison. Or am I (as I often am) missing something about how those numbers get used? Obviously, decimal versions of the IDs would be darn big numbers and less readable, but if that is the concern, would there be any objection to providing a switch on the utility to choose which radix and comparison function to use? --- A. Soroka The University of Virginia Library
