Hey Guys, So, I've completed my crawl of the vault.fbi.gov website for my class that I'm preparing for. I've got:
[chipotle:local/nutch/framework] mattmann% du -hs crawl 28G crawl [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ total 0 drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 20111127104947/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 20111127104955/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 20111127105006/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 20111127105251/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 20111127125721/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 20111127144648/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 20111127164220/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 20111127184345/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 20111127204447/ drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 20111127224816/ [chipotle:local/nutch/framework] mattmann% ./bin/nutch readseg -list -dir crawl/segments/ NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20111127104947 1 2011-11-27T10:49:50 2011-11-27T10:49:50 1 1 20111127104955 31 2011-11-27T10:49:57 2011-11-27T10:49:58 31 31 20111127105006 4898 2011-11-27T10:50:08 2011-11-27T10:51:40 4898 4890 20111127105251 9890 2011-11-27T10:52:52 2011-11-27T11:56:06 714 713 20111127125721 9202 2011-11-27T12:57:24 2011-11-27T14:00:17 971 686 20111127144648 8261 2011-11-27T14:46:50 2011-11-27T15:48:25 714 712 20111127164220 7575 2011-11-27T16:42:22 2011-11-27T17:45:50 720 718 20111127184345 6871 2011-11-27T18:43:48 2011-11-27T19:47:11 767 766 20111127204447 6116 2011-11-27T20:44:50 2011-11-27T21:48:07 725 724 20111127224816 5406 2011-11-27T22:48:18 2011-11-27T23:51:33 744 744 [chipotle:local/nutch/framework] mattmann% So the reality is, after crawling vault.fbi.gov, all I really wanted is the extracted PDF files that are housed in those segments. I've been playing around with ./bin/nutch readseg, and all I can say based on my initial impressions here are that it's really hard to get it to fulfill these simple requirements that I want it to do: 1. Iterate over all the segments - pull out URLs that have at_download/file in them - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor is the readable PDF name, the actual URL is a Plone CMS url, with little meaning) 2. for each PDF file anchor name - create a file in output_dir with the PDF file data read from the segment My guess is that even at the scale of data that I'm dealing with (10s of GB), that it's impossible and impractical to do anything that's not M/R here. Unfortunately there isn't a tool that will simply grab me the PDF files out of the segment files and then output those into a director, appropriately named with the anchor text. Or...is there? ;-) I'm running in Local mode, with no Hadoop cluster behind me, and with a Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working, intentionally as I don't want it to be a requirement for folks to have a cluster to do this assignment that I'm working on. I was talking to Ken Krugler about this, and after picking his brain, I think that I'm going to have to end up writing a tool to do what I want. So, if that's the case, fine, but can someone point me in the right direction for a good starting point for this? Ken also thought Andrzej might have like 10 magic solutions to make this happen, so here's hoping he's out there listening :-) Thanks for the help, guys. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

