Hi folks,

I'm in the process of cleaning up my WhitelistURLFilter (NUTCH-87 on JIRA), and I've got a question about working with org.apache.nutch.io.MapFile.

I am parsing a textfile with one key/value pair per line. I want to write this into a new MapFile. MapFile.Writer requires keys to be added strictly in-order, so currently I:
- read the textfile into an ArrayList
- sort (in RAM)
- write the MapFile

Clearly this won't scale for a large textfile, so I'm changing it to use as temporary SequenceFile instead. Then I'll sort the SequenceFile, and copy item-by-item into the MapFile.

While I'm doing this, I'm wondering if there isn't a way to avoid the 2nd copy.

Is there a way to create a MapFile from an already-sorted SequenceFile? Or, create an unsorted "data" file, sort it, then add the "index"? I didn't see anything in MapFile.* to permit this, but I'm probably missing something.

--Matt

--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to