Hey Lewis, Makes total sense. I'll get a patch going this week.
Cheers, Chris On Nov 30, 2011, at 8:05 AM, Lewis John Mcgibbney wrote: > Hi Chris, > > There is absolutely no doubt that this is of use, exactly for the issues > Markus highlights. > I wonder if it is worth adding general options similar to that which are > offered by readseg [1]. This would mean that it would be possible to ignore > certain directories within a segments directory, therefore reducing overhead > on the SegmentContentDumper tool and possibly providing a more accurate > content dump. Does this make any sense? > > [1] http://wiki.apache.org/nutch/bin/nutch_readseg > > On Tue, Nov 29, 2011 at 8:01 AM, Markus Jelsma <[email protected]> > wrote: > Sounds useful indeed! Especially with the regex pattern. Reading files from > the fs is a lot faster then using segread all the time. > > > > > CTO - Openindex.io > "Mattmann, Chris A (388J)" <[email protected]> schreef: > > OK, of course, I figured it out, and updated my program :-) > > You can see it on Github below. I'm going to clean up and > generalize this program because I think it's of general use. > I'll create an issue shortly. > > I'm thinking the tool could be something like: > > ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options] > -segmentRootDir full file path to the root segment directory, e.g., > crawl/segments > -regexUrlPattern a regex URL pattern to select URL keys to dump from the > content DB in each segment > -outputDir The output directory to write file names to. > -metadata --key=value where key is a Content Metadata key and value is a > value to check. If the URL and > its content metadata have a matching key,value pair, dump it. Allow for regex > matching on the value. > > This would allow users to unravel the content hidden in segment directories > and in sequence files > into useable files that were downloaded by Nutch. > > Do you guys see this as a useful tool? If so, I'll contribute it this week > for 1.5. > > Cheers, > Chris > > On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote: > > > Hey Guys, > > > > One more thing. Just to let you know I've followed this blog here: > > > > http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/ > > > > And started to write a simple program to read the keys in a > > Segment file, and then dump out the byte content if the key > > matches the desired URL. You can find my code here: > > > > https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java > > > > Unfortunately, this code keeps dying due to OOM issues, > > clearly because the data file is too big, and because > > I likely have to M/R this. > > > > Just wanted to let you guys know where I'm at, and what > > I've been trying. > > > > Thanks, > > Chris > > > > On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote: > > > >> Hey Guys, > >> > >> So, I've completed my crawl of the vault.fbi.gov website for my class that > >> I'm preparing > >> for. I've got: > >> > >> [chipotle:local/nutch/framework] mattmann% du -hs crawl > >> 28G crawl > >> [chipotle:local/nutch/framework] mattmann% > >> > >> [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ > >> total 0 > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 20111127104947/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 20111127104955/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 20111127105006/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 20111127105251/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 20111127125721/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 20111127144648/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 20111127164220/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 20111127184345/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 20111127204447/ > >> drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 20111127224816/ > >> [chipotle:local/nutch/framework] mattmann% > >> > >> ./bin/nutch readseg -list -dir crawl/segments/ > >> NAME GENERATED FETCHER START FETCHER END FETCHED > >> PARSED > >> 20111127104947 1 2011-11-27T10:49:50 2011-11-27T10:49:50 > >> 1 1 > >> 20111127104955 31 2011-11-27T10:49:57 2011-11-27T10:49:58 > >> 31 31 > >> 20111127105006 4898 2011-11-27T10:50:08 2011-11-27T10:51:40 > >> 4898 4890 > >> 20111127105251 9890 2011-11-27T10:52:52 2011-11-27T11:56:06 > >> 714 713 > >> 20111127125721 9202 2011-11-27T12:57:24 2011-11-27T14:00:17 > >> 971 686 > >> 20111127144648 8261 2011-11-27T14:46:50 2011-11-27T15:48:25 > >> 714 712 > >> 20111127164220 7575 2011-11-27T16:42:22 2011-11-27T17:45:50 > >> 720 718 > >> 20111127184345 6871 2011-11-27T18:43:48 2011-11-27T19:47:11 > >> 767 766 > >> 20111127204447 6116 2011-11-27T20:44:50 2011-11-27T21:48:07 > >> 725 724 > >> 20111127224816 5406 2011-11-27T22:48:18 2011-11-27T23:51:33 > >> 744 744 > >> [chipotle:local/nutch/framework] mattmann% > >> > >> So the reality is, after crawling vault.fbi.gov, all I really wanted is > >> the extracted PDF files > >> that are housed in those segments. I've been playing around with > >> ./bin/nutch readseg, > >> and all I can say based on my initial impressions here are that it's > >> really hard to > >> get it to fulfill these simple requirements that I want it to do: > >> > >> 1. Iterate over all the segments > >> - pull out URLs that have at_download/file in them > >> - for each one of those URLs, get their anchor, aka somefile.pdf (the > >> anchor is the readable PDF name, > >> the actual URL is a Plone CMS url, with little meaning) > >> > >> 2. for each PDF file anchor name > >> - create a file in output_dir with the PDF file data read from the segment > >> > >> My guess is that even at the scale of data that I'm dealing with (10s of > >> GB), that it's impossible > >> and impractical to do anything that's not M/R here. Unfortunately there > >> isn't a tool that will simply > >> grab me the PDF files out of the segment files and then output those into > >> a director, appropriately > >> named with the anchor text. Or...is there? ;-) > >> > >> I'm running in Local mode, with no Hadoop cluster behind me, and with a > >> Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working, > >> intentionally as I don't want it to be a requirement for folks to have a > >> cluster > >> to do this assignment that I'm working on. > >> > >> I was talking to Ken Krugler about this, and after picking his brain, I > >> think that > >> I'm going to have to end up writing a tool to do what I want. So, if > >> that's the case, > >> fine, but can someone point me in the right direction for a good starting > >> point > >> for this? Ken also thought Andrzej might have like 10 magic solutions to > >> make > >> this happen, so here's hoping he's out there listening :-) > >> > >> Thanks for the help, guys. > >> > >> Cheers, > >> Chris > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: [email protected] > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > -- > Lewis > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

