Thanks for the quick feedback! I'll use the existing facilities to
finish NUTCH-87 for now. There's a good chance that I'll need to do
more stuff like this soon, 'tho, and if so, I'll consider patching
MapFile.
--Matt
On Jan 6, 2006, at 2:12 PM, Doug Cutting wrote:
Matt Kangas wrote:
Clearly this won't scale for a large textfile, so I'm changing it
to use as temporary SequenceFile instead. Then I'll sort the
SequenceFile, and copy item-by-item into the MapFile.
While I'm doing this, I'm wondering if there isn't a way to avoid
the 2nd copy.
No, not presently. So the cost of sorting becomes n*(log(n)+1),
which is to say, the 2nd copy will slow things, but not hugely.
Is there a way to create a MapFile from an already-sorted
SequenceFile?
No, but it wouldn't be too hard to add one.
Or, create an unsorted "data" file, sort it, then add the "index"?
Right, that's the way I'd implement sorted-SequenceFile ->
Mapfile. So MapFile might get a new public static method that:
1. Moves a sorted SequenceFile to <File>/data; and
2. Creates an index file in <File>/index.
Note that this would still have to read the entire SequenceFile, so
all that's saved is re-writing it.
Doug
--
Matt Kangas / [EMAIL PROTECTED]