Re: Best way to get files out of segment directories

Mattmann, Chris A (388J) Mon, 28 Nov 2011 19:32:59 -0800

Hey Guys,

One more thing. Just to let you know I've followed this blog here:


http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/

And started to write a simple program to read the keys in a 
Segment file, and then dump out the byte content if the key
matches the desired URL. You can find my code here:

https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java

Unfortunately, this code keeps dying due to OOM issues, 
clearly because the data file is too big, and because 
I likely have to M/R this. 

Just wanted to let you guys know where I'm at, and what
I've been trying.

Thanks,
Chris

On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:

> Hey Guys,
> 
> So, I've completed my crawl of the vault.fbi.gov website for my class that 
> I'm preparing 
> for. I've got:
> 
> [chipotle:local/nutch/framework] mattmann% du -hs crawl
> 28G   crawl
> [chipotle:local/nutch/framework] mattmann% 
> 
> [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
> total 0
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 20111127104947/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 20111127104955/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 20111127105006/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 20111127105251/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 20111127125721/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 20111127144648/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 20111127164220/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 20111127184345/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 20111127204447/
> drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 20111127224816/
> [chipotle:local/nutch/framework] mattmann% 
> 
> ./bin/nutch readseg -list -dir crawl/segments/
> NAME          GENERATED       FETCHER START           FETCHER END             
> FETCHED PARSED
> 20111127104947        1               2011-11-27T10:49:50     
> 2011-11-27T10:49:50     1       1
> 20111127104955        31              2011-11-27T10:49:57     
> 2011-11-27T10:49:58     31      31
> 20111127105006        4898            2011-11-27T10:50:08     
> 2011-11-27T10:51:40     4898    4890
> 20111127105251        9890            2011-11-27T10:52:52     
> 2011-11-27T11:56:06     714     713
> 20111127125721        9202            2011-11-27T12:57:24     
> 2011-11-27T14:00:17     971     686
> 20111127144648        8261            2011-11-27T14:46:50     
> 2011-11-27T15:48:25     714     712
> 20111127164220        7575            2011-11-27T16:42:22     
> 2011-11-27T17:45:50     720     718
> 20111127184345        6871            2011-11-27T18:43:48     
> 2011-11-27T19:47:11     767     766
> 20111127204447        6116            2011-11-27T20:44:50     
> 2011-11-27T21:48:07     725     724
> 20111127224816        5406            2011-11-27T22:48:18     
> 2011-11-27T23:51:33     744     744
> [chipotle:local/nutch/framework] mattmann% 
> 
> So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
> extracted PDF files
> that are housed in those segments. I've been playing around with ./bin/nutch 
> readseg, 
> and all I can say based on my initial impressions here are that it's really 
> hard to 
> get it to fulfill these simple requirements that I want it to do:
> 
> 1. Iterate over all the segments 
>  - pull out URLs that have at_download/file in them
>  - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
> is the readable PDF name,
> the actual URL is a Plone CMS url, with little meaning)
> 
> 2. for each PDF file anchor name
>   - create a file in output_dir with the PDF file data read from the segment
> 
> My guess is that even at the scale of data that I'm dealing with (10s of GB), 
> that it's impossible
> and impractical to do anything that's not M/R here. Unfortunately there isn't 
> a tool that will simply
> grab me the PDF files out of the segment files and then output those into a 
> director, appropriately 
> named with the anchor text. Or...is there? ;-)
> 
> I'm running in Local mode, with no Hadoop cluster behind me, and with a 
> Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working,
> intentionally as I don't want it to be a requirement for folks to have a 
> cluster
> to do this assignment that I'm working on.
> 
> I was talking to Ken Krugler about this, and after picking his brain, I think 
> that 
> I'm going to have to end up writing a tool to do what I want. So, if that's 
> the case, 
> fine, but can someone point me in the right direction for a good starting 
> point
> for this? Ken also thought Andrzej might have like 10 magic solutions to make 
> this happen, so here's hoping he's out there listening :-)
> 
> Thanks for the help, guys.
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Best way to get files out of segment directories

Reply via email to