At Alexa, we have huge amounts of data (100's of terabytes) on a 
network of cheap UNIX machines (somewhere around 1000 such machines).

The standard textutils distribution needs some changes to be 
maximally useful to us in this environment. I would like to describe 
some of the changes we've made to textutils with the hope of 
generating some discussion about what parts might be appropriate to 
fold back into the main distribution, as well as discussion of the 
strategies themselves.

For no good reason, we name these tools with an av_ prefix; so when I 
mention av_sort (or whatever) I mean our version of sort.

Also please insert "when appropriate", "when possible" and such 
throughout the discussion below.


1) Distributed Computing and Named Pipes
    One of our primary methodologies involves running things on a 
bunch of machines, and combining the results through named pipes. The 
two general rules that appear are
a) read from all the files at once, rather than reading each 
completely in turn.
b) open all the files before reading from any of them.

For example, "sort -m" already does a), and requires very little 
effort to enforce b) as well.

av_cat reads what is available from each file, producing output with 
all the right lines in it, but merged in a non-deterministic order.


2) gzip
    Rather than buying 2 or 3 times as many machines, we gzip almost 
everything. The Alexa versions of textutils replaces stdio with the 
zlib stdio-like interface, and thus can work on compressed or 
uncompressed files willy-nilly. (We also have a special way of 
zipping that lets you binary search (and otherwise randomly access) a 
zipped file, while still letting unmodified gunzip do the right 
thing, but that's not really on topic).


3) threads
    For both performance, and for named pipe use, many tools end up 
being threaded. av_cat and av_split have one thread per file. av_sort 
has three threads, one thread each for reading, writing and sorting. 
I'm guessing threads as part of the standard textutils is not an 
option.

4) sort
    Sorting hundreds of gigabytes can take a while. av_sort.c is 
rather dramatically different from sort.c, even though their output 
is identical. In addition to the threads mentioned above, we allow 
merges of arbitrary arity (instead of fixed at 16). A custom sort 
(based on qsort) for the usual non-stable case and a much larger 
default memory allocation, just to name a few.

5) big
    Some tools can have problems with huge files; for example, the 
join patch I submitted last Thursday.

6) sorted order
    Only slightly off topic : several textutils tools operate on 
sorted files. Unfortunately, the all seem to have a different 
interface for expressing the sort order, and different capabilities 
for sorting. Thus it isn't always possible to join against what you 
have just sorted. I'm toying with a shared module that interprets 
'--k' parameters and handles the comparisons. Has anyone else seen 
this need? Has anyone else come up with a solution?


Anyway, as I said I'm hoping for two things
1) some spirited discussion
2) some indication as to what changes should be submitted as patches


Andy Jewell
[EMAIL PROTECTED]



_______________________________________________
Bug-textutils mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/bug-textutils

Reply via email to