I've been tinkering with tdbloader2 and optimizing a use of it, and now I've 
got a question about the intermediate files created (data-quads.tmp and 
data-triples.tmp).

They contain hexadecimal numbers that represent node IDs in tuples; the 
contents of the database to be build in tabular form. The TDB node IDs are 64 
bit integers, if I remember correctly, and as I say, they are represented in 
these files as long hex strings. These data files are sorted before being 
packed into indexes, and that sort occurs by using plain old POSIX `sort`.

If `sort` is the tool to be used (or at least the default, since it can be 
aliased out if appropriate) wouldn't it make more sense for those IDs to be 
decimal-radix integers, so that numeric comparison (which is often faster 
because it avoids locale machinery; even 'C' locale has some work involved) 
could be used in `sort`? To my knowledge, most `sort`s out there won't handle 
hex with numeric comparison-- only string comparison.

Or am I (as I often am) missing something about how those numbers get used? 
Obviously, decimal versions of the IDs would be darn big numbers and less 
readable, but if that is the concern, would there be any objection to providing 
a switch on the utility to choose which radix and comparison function to use?

---
A. Soroka
The University of Virginia Library

Reply via email to