Re: Best way to get files out of segment directories

Lewis John Mcgibbney Wed, 30 Nov 2011 08:06:12 -0800

Hi Chris,

There is absolutely no doubt that this is of use, exactly for the issues
Markus highlights.
I wonder if it is worth adding general options similar to that which are
offered by readseg [1]. This would mean that it would be possible to ignore
certain directories within a segments directory, therefore reducing
overhead on the SegmentContentDumper tool and possibly providing a more
accurate content dump. Does this make any sense?


[1] http://wiki.apache.org/nutch/bin/nutch_readseg

On Tue, Nov 29, 2011 at 8:01 AM, Markus Jelsma
<[email protected]>wrote:

> Sounds useful indeed! Especially with the regex pattern. Reading files
> from the fs is a lot faster then using segread all the time.
>
>
>
>
> CTO - Openindex.io
> "Mattmann, Chris A (388J)" <[email protected]> schreef:
>
> OK, of course, I figured it out, and updated my program :-)
>
> You can see it on Github below. I'm going to clean up and
> generalize this program because I think it's of general use.
> I'll create an issue shortly.
>
> I'm thinking the tool could be something like:
>
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>   -segmentRootDir full file path to the root segment directory, e.g.,
> crawl/segments
>   -regexUrlPattern a regex URL pattern to select URL keys to dump from the
> content DB in each segment
>   -outputDir The output directory to write file names to.
>   -metadata --key=value where key is a Content Metadata key and value is a
> value to check. If the URL and
> its content metadata have a matching key,value pair, dump it. Allow for
> regex matching on the value.
>
> This would allow users to unravel the content hidden in segment
> directories and in sequence files
> into useable files that were downloaded by Nutch.
>
> Do you guys see this as a useful tool? If so, I'll contribute it this week
> for 1.5.
>
> Cheers,
> Chris
>
> On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote:
>
> > Hey Guys,
> >
> > One more thing. Just to let you know I've followed this blog here:
> >
> >
> http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/
> >
> > And started to write a simple program to read the keys in a
> > Segment file, and then dump out the byte content if the key
> > matches the desired URL. You can find my code here:
> >
> >
> https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java
> >
> > Unfortunately, this code keeps dying due to OOM issues,
> > clearly because the data file is too big, and because
> > I likely have to M/R this.
> >
> > Just wanted to let you guys know where I'm at, and what
> > I've been trying.
> >
> > Thanks,
> > Chris
> >
> > On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:
> >
> >> Hey Guys,
> >>
> >> So, I've completed my crawl of the vault.fbi.gov website for my class
> that I'm preparing
> >> for. I've got:
> >>
> >> [chipotle:local/nutch/framework] mattmann% du -hs crawl
> >> 28G crawl
> >> [chipotle:local/nutch/framework] mattmann%
> >>
> >> [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
> >> total 0
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 20111127104947/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 20111127104955/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 20111127105006/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 20111127105251/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 20111127125721/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 20111127144648/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 20111127164220/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 20111127184345/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 20111127204447/
> >> drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 20111127224816/
> >> [chipotle:local/nutch/framework] mattmann%
> >>
> >> ./bin/nutch readseg -list -dir crawl/segments/
> >> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> >> 20111127104947 1 2011-11-27T10:49:50 2011-11-27T10:49:50 1 1
> >> 20111127104955 31 2011-11-27T10:49:57 2011-11-27T10:49:58 31 31
> >> 20111127105006 4898 2011-11-27T10:50:08 2011-11-27T10:51:40 4898 4890
> >> 20111127105251 9890 2011-11-27T10:52:52 2011-11-27T11:56:06 714 713
> >> 20111127125721 9202 2011-11-27T12:57:24 2011-11-27T14:00:17 971 686
> >> 20111127144648 8261 2011-11-27T14:46:50 2011-11-27T15:48:25 714 712
> >> 20111127164220 7575 2011-11-27T16:42:22 2011-11-27T17:45:50 720 718
> >> 20111127184345 6871 2011-11-27T18:43:48 2011-11-27T19:47:11 767 766
> >> 20111127204447 6116 2011-11-27T20:44:50 2011-11-27T21:48:07 725 724
> >> 20111127224816 5406 2011-11-27T22:48:18 2011-11-27T23:51:33 744 744
> >> [chipotle:local/nutch/framework] mattmann%
> >>
> >> So the reality is, after crawling vault.fbi.gov, all I really wanted
> is the extracted PDF files
> >> that are housed in those segments. I've been playing around with
> ./bin/nutch readseg,
> >> and all I can say based on my initial impressions here are that it's
> really hard to
> >> get it to fulfill these simple requirements that I want it to do:
> >>
> >> 1. Iterate over all the segments
> >> - pull out URLs that have at_download/file in them
> >> - for each one of those URLs, get their anchor, aka somefile.pdf (the
> anchor is the readable PDF name,
> >> the actual URL is a Plone CMS url, with little meaning)
> >>
> >> 2. for each PDF file anchor name
> >>  - create a file in output_dir with the PDF file data read from the
> segment
> >>
> >> My guess is that even at the scale of data that I'm dealing with (10s
> of GB), that it's impossible
> >> and impractical to do anything that's not M/R here. Unfortunately there
> isn't a tool that will simply
> >> grab me the PDF files out of the segment files and then output those
> into a director, appropriately
> >> named with the anchor text. Or...is there? ;-)
> >>
> >> I'm running in Local mode, with no Hadoop cluster behind me, and with a
> >> Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this
> working,
> >> intentionally as I don't want it to be a requirement for folks to have
> a cluster
> >> to do this assignment that I'm working on.
> >>
> >> I was talking to Ken Krugler about this, and after picking his brain, I
> think that
> >> I'm going to have to end up writing a tool to do what I want. So, if
> that's the case,
> >> fine, but can someone point me in the right direction for a good
> starting point
> >> for this? Ken also thought Andrzej might have like 10 magic solutions
> to make
> >> this happen, so here's hoping he's out there listening :-)
> >>
> >> Thanks for the help, guys.
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [email protected]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: [email protected]
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*Lewis*

Re: Best way to get files out of segment directories

Reply via email to