Hey Lewis,

Makes total sense. I'll get a patch going this week.

Cheers,
Chris

On Nov 30, 2011, at 8:05 AM, Lewis John Mcgibbney wrote:

> Hi Chris,
> 
> There is absolutely no doubt that this is of use, exactly for the issues 
> Markus highlights.
> I wonder if it is worth adding general options similar to that which are 
> offered by readseg [1]. This would mean that it would be possible to ignore 
> certain directories within a segments directory, therefore reducing overhead 
> on the SegmentContentDumper tool and possibly providing a more accurate 
> content dump. Does this make any sense?
> 
> [1] http://wiki.apache.org/nutch/bin/nutch_readseg
> 
> On Tue, Nov 29, 2011 at 8:01 AM, Markus Jelsma <[email protected]> 
> wrote:
> Sounds useful indeed! Especially with the regex pattern. Reading files from 
> the fs is a lot faster then using segread all the time.
> 
> 
> 
> 
> CTO - Openindex.io 
> "Mattmann, Chris A (388J)" <[email protected]> schreef:
> 
> OK, of course, I figured it out, and updated my program :-)
> 
> You can see it on Github below. I'm going to clean up and 
> generalize this program because I think it's of general use.
> I'll create an issue shortly. 
> 
> I'm thinking the tool could be something like:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>   -segmentRootDir full file path to the root segment directory, e.g., 
> crawl/segments
>   -regexUrlPattern a regex URL pattern to select URL keys to dump from the 
> content DB in each segment
>   -outputDir The output directory to write file names to.
>   -metadata --key=value where key is a Content Metadata key and value is a 
> value to check. If the URL and
> its content metadata have a matching key,value pair, dump it. Allow for regex 
> matching on the value.
> 
> This would allow users to unravel the content hidden in segment directories 
> and in sequence files
> into useable files that were downloaded by Nutch.
> 
> Do you guys see this as a useful tool? If so, I'll contribute it this week 
> for 1.5.
> 
> Cheers,
> Chris
> 
> On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote:
> 
> > Hey Guys,
> > 
> > One more thing. Just to let you know I've followed this blog here:
> > 
> > http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/
> > 
> > And started to write a simple program to read the keys in a 
> > Segment file, and then dump out the byte content if the key
> > matches the desired URL. You can find my code here:
> > 
> > https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java
> > 
> > Unfortunately, this code keeps dying due to OOM issues, 
> > clearly because the data file is too big, and because 
> > I likely have to M/R this. 
> > 
> > Just wanted to let you guys know where I'm at, and what
> > I've been trying.
> > 
> > Thanks,
> > Chris
> > 
> > On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:
> > 
> >> Hey Guys,
> >> 
> >> So, I've completed my crawl of the vault.fbi.gov website for my class that 
> >> I'm preparing 
> >> for. I've got:
> >> 
> >> [chipotle:local/nutch/framework] mattmann% du -hs crawl
> >> 28G        crawl
> >> [chipotle:local/nutch/framework] mattmann% 
> >> 
> >> [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
> >> total 0
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 20111127104947/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 20111127104955/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 20111127105006/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 20111127105251/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 20111127125721/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 20111127144648/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 20111127164220/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 20111127184345/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 20111127204447/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 20111127224816/
> >> [chipotle:local/nutch/framework] mattmann% 
> >> 
> >> ./bin/nutch readseg -list -dir crawl/segments/
> >> NAME        GENERATED      FETCHER START    FETCHER END     FETCHED        
> >> PARSED
> >> 20111127104947     1        2011-11-27T10:49:50    2011-11-27T10:49:50     
> >> 1       1
> >> 20111127104955     31       2011-11-27T10:49:57    2011-11-27T10:49:58     
> >> 31      31
> >> 20111127105006     4898     2011-11-27T10:50:08    2011-11-27T10:51:40     
> >> 4898    4890
> >> 20111127105251     9890     2011-11-27T10:52:52    2011-11-27T11:56:06     
> >> 714     713
> >> 20111127125721     9202     2011-11-27T12:57:24    2011-11-27T14:00:17     
> >> 971     686
> >> 20111127144648     8261     2011-11-27T14:46:50    2011-11-27T15:48:25     
> >> 714     712
> >> 20111127164220     7575     2011-11-27T16:42:22    2011-11-27T17:45:50     
> >> 720     718
> >> 20111127184345     6871     2011-11-27T18:43:48    2011-11-27T19:47:11     
> >> 767     766
> >> 20111127204447     6116     2011-11-27T20:44:50    2011-11-27T21:48:07     
> >> 725     724
> >> 20111127224816     5406     2011-11-27T22:48:18    2011-11-27T23:51:33     
> >> 744     744
> >> [chipotle:local/nutch/framework] mattmann% 
> >> 
> >> So the reality is, after crawling vault.fbi.gov, all I really wanted is 
> >> the extracted PDF files
> >> that are housed in those segments. I've been playing around with 
> >> ./bin/nutch readseg, 
> >> and all I can say based on my initial impressions here are that it's 
> >> really hard to 
> >> get it to fulfill these simple requirements that I want it to do:
> >> 
> >> 1. Iterate over all the segments 
> >> - pull out URLs that have at_download/file in them
> >> - for each one of those URLs, get their anchor, aka somefile.pdf (the 
> >> anchor is the readable PDF name,
> >> the actual URL is a Plone CMS url, with little meaning)
> >> 
> >> 2. for each PDF file anchor name
> >>  - create a file in output_dir with the PDF file data read from the segment
> >> 
> >> My guess is that even at the scale of data that I'm dealing with (10s of 
> >> GB), that it's impossible
> >> and impractical to do anything that's not M/R here. Unfortunately there 
> >> isn't a tool that will simply
> >> grab me the PDF files out of the segment files and then output those into 
> >> a director, appropriately 
> >> named with the anchor text. Or...is there? ;-)
> >> 
> >> I'm running in Local mode, with no Hadoop cluster behind me, and with a 
> >> Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working,
> >> intentionally as I don't want it to be a requirement for folks to have a 
> >> cluster
> >> to do this assignment that I'm working on.
> >> 
> >> I was talking to Ken Krugler about this, and after picking his brain, I 
> >> think that 
> >> I'm going to have to end up writing a tool to do what I want. So, if 
> >> that's the case, 
> >> fine, but can someone point me in the right direction for a good starting 
> >> point
> >> for this? Ken also thought Andrzej might have like 10 magic solutions to 
> >> make 
> >> this happen, so here's hoping he's out there listening :-)
> >> 
> >> Thanks for the help, guys.
> >> 
> >> Cheers,
> >> Chris
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [email protected]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 
> > 
> > 
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: [email protected]
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> -- 
> Lewis 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to