Hi folks,
I'm in the process of cleaning up my WhitelistURLFilter (NUTCH-87 on
JIRA), and I've got a question about working with
org.apache.nutch.io.MapFile.
I am parsing a textfile with one key/value pair per line. I want to
write this into a new MapFile. MapFile.Writer requires keys to be
added strictly in-order, so currently I:
- read the textfile into an ArrayList
- sort (in RAM)
- write the MapFile
Clearly this won't scale for a large textfile, so I'm changing it to
use as temporary SequenceFile instead. Then I'll sort the
SequenceFile, and copy item-by-item into the MapFile.
While I'm doing this, I'm wondering if there isn't a way to avoid the
2nd copy.
Is there a way to create a MapFile from an already-sorted
SequenceFile? Or, create an unsorted "data" file, sort it, then add
the "index"? I didn't see anything in MapFile.* to permit this, but
I'm probably missing something.
--Matt
--
Matt Kangas / [EMAIL PROTECTED]