Hey Guys, One more thing. Just to let you know I've followed this blog here:
http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/ And started to write a simple program to read the keys in a Segment file, and then dump out the byte content if the key matches the desired URL. You can find my code here: https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java Unfortunately, this code keeps dying due to OOM issues, clearly because the data file is too big, and because I likely have to M/R this. Just wanted to let you guys know where I'm at, and what I've been trying. Thanks, Chris On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote: > Hey Guys, > > So, I've completed my crawl of the vault.fbi.gov website for my class that > I'm preparing > for. I've got: > > [chipotle:local/nutch/framework] mattmann% du -hs crawl > 28G crawl > [chipotle:local/nutch/framework] mattmann% > > [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ > total 0 > drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 20111127104947/ > drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 20111127104955/ > drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 20111127105006/ > drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 20111127105251/ > drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 20111127125721/ > drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 20111127144648/ > drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 20111127164220/ > drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 20111127184345/ > drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 20111127204447/ > drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 20111127224816/ > [chipotle:local/nutch/framework] mattmann% > > ./bin/nutch readseg -list -dir crawl/segments/ > NAME GENERATED FETCHER START FETCHER END > FETCHED PARSED > 20111127104947 1 2011-11-27T10:49:50 > 2011-11-27T10:49:50 1 1 > 20111127104955 31 2011-11-27T10:49:57 > 2011-11-27T10:49:58 31 31 > 20111127105006 4898 2011-11-27T10:50:08 > 2011-11-27T10:51:40 4898 4890 > 20111127105251 9890 2011-11-27T10:52:52 > 2011-11-27T11:56:06 714 713 > 20111127125721 9202 2011-11-27T12:57:24 > 2011-11-27T14:00:17 971 686 > 20111127144648 8261 2011-11-27T14:46:50 > 2011-11-27T15:48:25 714 712 > 20111127164220 7575 2011-11-27T16:42:22 > 2011-11-27T17:45:50 720 718 > 20111127184345 6871 2011-11-27T18:43:48 > 2011-11-27T19:47:11 767 766 > 20111127204447 6116 2011-11-27T20:44:50 > 2011-11-27T21:48:07 725 724 > 20111127224816 5406 2011-11-27T22:48:18 > 2011-11-27T23:51:33 744 744 > [chipotle:local/nutch/framework] mattmann% > > So the reality is, after crawling vault.fbi.gov, all I really wanted is the > extracted PDF files > that are housed in those segments. I've been playing around with ./bin/nutch > readseg, > and all I can say based on my initial impressions here are that it's really > hard to > get it to fulfill these simple requirements that I want it to do: > > 1. Iterate over all the segments > - pull out URLs that have at_download/file in them > - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor > is the readable PDF name, > the actual URL is a Plone CMS url, with little meaning) > > 2. for each PDF file anchor name > - create a file in output_dir with the PDF file data read from the segment > > My guess is that even at the scale of data that I'm dealing with (10s of GB), > that it's impossible > and impractical to do anything that's not M/R here. Unfortunately there isn't > a tool that will simply > grab me the PDF files out of the segment files and then output those into a > director, appropriately > named with the anchor text. Or...is there? ;-) > > I'm running in Local mode, with no Hadoop cluster behind me, and with a > Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working, > intentionally as I don't want it to be a requirement for folks to have a > cluster > to do this assignment that I'm working on. > > I was talking to Ken Krugler about this, and after picking his brain, I think > that > I'm going to have to end up writing a tool to do what I want. So, if that's > the case, > fine, but can someone point me in the right direction for a good starting > point > for this? Ken also thought Andrzej might have like 10 magic solutions to make > this happen, so here's hoping he's out there listening :-) > > Thanks for the help, guys. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

