OK, of course, I figured it out, and updated my program :-)

You can see it on Github below. I'm going to clean up and 
generalize this program because I think it's of general use.
I'll create an issue shortly. 

I'm thinking the tool could be something like:

./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
  -segmentRootDir full file path to the root segment directory, e.g., 
crawl/segments
  -regexUrlPattern a regex URL pattern to select URL keys to dump from the 
content DB in each segment
  -outputDir The output directory to write file names to.
  -metadata --key=value where key is a Content Metadata key and value is a 
value to check. If the URL and
its content metadata have a matching key,value pair, dump it. Allow for regex 
matching on the value.

This would allow users to unravel the content hidden in segment directories and 
in sequence files
into useable files that were downloaded by Nutch.

Do you guys see this as a useful tool? If so, I'll contribute it this week for 
1.5.

Cheers,
Chris

On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote:

> Hey Guys,
> 
> One more thing. Just to let you know I've followed this blog here:
> 
> http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/
> 
> And started to write a simple program to read the keys in a 
> Segment file, and then dump out the byte content if the key
> matches the desired URL. You can find my code here:
> 
> https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java
> 
> Unfortunately, this code keeps dying due to OOM issues, 
> clearly because the data file is too big, and because 
> I likely have to M/R this. 
> 
> Just wanted to let you guys know where I'm at, and what
> I've been trying.
> 
> Thanks,
> Chris
> 
> On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:
> 
>> Hey Guys,
>> 
>> So, I've completed my crawl of the vault.fbi.gov website for my class that 
>> I'm preparing 
>> for. I've got:
>> 
>> [chipotle:local/nutch/framework] mattmann% du -hs crawl
>> 28G  crawl
>> [chipotle:local/nutch/framework] mattmann% 
>> 
>> [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
>> total 0
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 20111127104947/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 20111127104955/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 20111127105006/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 20111127105251/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 20111127125721/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 20111127144648/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 20111127164220/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 20111127184345/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 20111127204447/
>> drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 20111127224816/
>> [chipotle:local/nutch/framework] mattmann% 
>> 
>> ./bin/nutch readseg -list -dir crawl/segments/
>> NAME         GENERATED       FETCHER START           FETCHER END             
>> FETCHED PARSED
>> 20111127104947       1               2011-11-27T10:49:50     
>> 2011-11-27T10:49:50     1       1
>> 20111127104955       31              2011-11-27T10:49:57     
>> 2011-11-27T10:49:58     31      31
>> 20111127105006       4898            2011-11-27T10:50:08     
>> 2011-11-27T10:51:40     4898    4890
>> 20111127105251       9890            2011-11-27T10:52:52     
>> 2011-11-27T11:56:06     714     713
>> 20111127125721       9202            2011-11-27T12:57:24     
>> 2011-11-27T14:00:17     971     686
>> 20111127144648       8261            2011-11-27T14:46:50     
>> 2011-11-27T15:48:25     714     712
>> 20111127164220       7575            2011-11-27T16:42:22     
>> 2011-11-27T17:45:50     720     718
>> 20111127184345       6871            2011-11-27T18:43:48     
>> 2011-11-27T19:47:11     767     766
>> 20111127204447       6116            2011-11-27T20:44:50     
>> 2011-11-27T21:48:07     725     724
>> 20111127224816       5406            2011-11-27T22:48:18     
>> 2011-11-27T23:51:33     744     744
>> [chipotle:local/nutch/framework] mattmann% 
>> 
>> So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
>> extracted PDF files
>> that are housed in those segments. I've been playing around with ./bin/nutch 
>> readseg, 
>> and all I can say based on my initial impressions here are that it's really 
>> hard to 
>> get it to fulfill these simple requirements that I want it to do:
>> 
>> 1. Iterate over all the segments 
>> - pull out URLs that have at_download/file in them
>> - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
>> is the readable PDF name,
>> the actual URL is a Plone CMS url, with little meaning)
>> 
>> 2. for each PDF file anchor name
>>  - create a file in output_dir with the PDF file data read from the segment
>> 
>> My guess is that even at the scale of data that I'm dealing with (10s of 
>> GB), that it's impossible
>> and impractical to do anything that's not M/R here. Unfortunately there 
>> isn't a tool that will simply
>> grab me the PDF files out of the segment files and then output those into a 
>> director, appropriately 
>> named with the anchor text. Or...is there? ;-)
>> 
>> I'm running in Local mode, with no Hadoop cluster behind me, and with a 
>> Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working,
>> intentionally as I don't want it to be a requirement for folks to have a 
>> cluster
>> to do this assignment that I'm working on.
>> 
>> I was talking to Ken Krugler about this, and after picking his brain, I 
>> think that 
>> I'm going to have to end up writing a tool to do what I want. So, if that's 
>> the case, 
>> fine, but can someone point me in the right direction for a good starting 
>> point
>> for this? Ken also thought Andrzej might have like 10 magic solutions to 
>> make 
>> this happen, so here's hoping he's out there listening :-)
>> 
>> Thanks for the help, guys.
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to