I haven't heard anything about such problems with straight integers (which is what we have here) but I may very well just not have come across it. Indeed, keeping a non-Unicode locale helps a great deal, and there are probably other places that TDB loading could go faster-- I'm just looking for low-hanging fruit and I am also honestly curious why text files with hex was chosen (instead, for example, of some very compact format with a sort algorithm in the Java).
My (very rough, not carefully controlled) example with about 300Mt showed that sort was actually a good chunk of the index phase (as opposed to the data phase). It's not obvious to me that there could be anything special about my data, but there might be, I suppose. --- A. Soroka The University of Virginia Library > On Oct 21, 2016, at 9:22 AM, Stian Soiland-Reyes <[email protected]> wrote: > > Isn't numerical sorting a fair bit slower if you need to convert from > decimal to binary representation first? The algorithm for this is quite > convoluted and don't have fixed costs - until recently there was even a bug > on some platforms were a particular string caused an infinite loop. (But > that might have been to floating points :-) > > Byte-by-byte comparison without unicode should be fairly fast.. but worth > checking if "sort" is a slowdown (I didn't think it was the slowest bit of > tdbloader2) > > On 20 Oct 2016 10:01 pm, "A. Soroka" <[email protected]> wrote: > >> I've been tinkering with tdbloader2 and optimizing a use of it, and now >> I've got a question about the intermediate files created (data-quads.tmp >> and data-triples.tmp). >> >> They contain hexadecimal numbers that represent node IDs in tuples; the >> contents of the database to be build in tabular form. The TDB node IDs are >> 64 bit integers, if I remember correctly, and as I say, they are >> represented in these files as long hex strings. These data files are sorted >> before being packed into indexes, and that sort occurs by using plain old >> POSIX `sort`. >> >> If `sort` is the tool to be used (or at least the default, since it can be >> aliased out if appropriate) wouldn't it make more sense for those IDs to be >> decimal-radix integers, so that numeric comparison (which is often faster >> because it avoids locale machinery; even 'C' locale has some work involved) >> could be used in `sort`? To my knowledge, most `sort`s out there won't >> handle hex with numeric comparison-- only string comparison. >> >> Or am I (as I often am) missing something about how those numbers get >> used? Obviously, decimal versions of the IDs would be darn big numbers and >> less readable, but if that is the concern, would there be any objection to >> providing a switch on the utility to choose which radix and comparison >> function to use? >> >> --- >> A. Soroka >> The University of Virginia Library >> >>
