Hello, I've written an extension to the Internet Archive's open source "Heritrix" crawler that extends it to write into HDFS in SequenceFile format. The key is the URL and the value is the HTTP response with some additional metadata. It's actually quite simple to use, just drop a few jar files into the Heritrix lib/ directory and you're good to go. Here's a link to the download page: http://www.zvents.com/labs/hdfs_writer_processor . For those of you who are interested, give it a whirl and feel free to send me feedback.
- Doug Judd [EMAIL PROTECTED]