Thanks for the quick feedback! I'll use the existing facilities to finish NUTCH-87 for now. There's a good chance that I'll need to do more stuff like this soon, 'tho, and if so, I'll consider patching MapFile.

--Matt

On Jan 6, 2006, at 2:12 PM, Doug Cutting wrote:

Matt Kangas wrote:
Clearly this won't scale for a large textfile, so I'm changing it to use as temporary SequenceFile instead. Then I'll sort the SequenceFile, and copy item-by-item into the MapFile. While I'm doing this, I'm wondering if there isn't a way to avoid the 2nd copy.

No, not presently. So the cost of sorting becomes n*(log(n)+1), which is to say, the 2nd copy will slow things, but not hugely.

Is there a way to create a MapFile from an already-sorted SequenceFile?

No, but it wouldn't be too hard to add one.

Or, create an unsorted "data" file, sort it, then add  the "index"?

Right, that's the way I'd implement sorted-SequenceFile -> Mapfile. So MapFile might get a new public static method that:
  1. Moves a sorted SequenceFile to <File>/data; and
  2. Creates an index file in <File>/index.

Note that this would still have to read the entire SequenceFile, so all that's saved is re-writing it.

Doug

--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to