Matt Kangas wrote:
Clearly this won't scale for a large textfile, so I'm changing it to
use as temporary SequenceFile instead. Then I'll sort the SequenceFile,
and copy item-by-item into the MapFile.
While I'm doing this, I'm wondering if there isn't a way to avoid the
2nd copy.
No, not presently. So the cost of sorting becomes n*(log(n)+1), which
is to say, the 2nd copy will slow things, but not hugely.
Is there a way to create a MapFile from an already-sorted SequenceFile?
No, but it wouldn't be too hard to add one.
Or, create an unsorted "data" file, sort it, then add the "index"?
Right, that's the way I'd implement sorted-SequenceFile -> Mapfile. So
MapFile might get a new public static method that:
1. Moves a sorted SequenceFile to <File>/data; and
2. Creates an index file in <File>/index.
Note that this would still have to read the entire SequenceFile, so all
that's saved is re-writing it.
Doug