Best way to get files out of segment directories

Mattmann, Chris A (388J) Mon, 28 Nov 2011 19:22:37 -0800

Hey Guys,

So, I've completed my crawl of the vault.fbi.gov website for my class that I'm 
preparing 
for. I've got:


[chipotle:local/nutch/framework] mattmann% du -hs crawl
 28G    crawl
[chipotle:local/nutch/framework] mattmann% 

[chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
total 0
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 20111127104947/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 20111127104955/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 20111127105006/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 20111127105251/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 20111127125721/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 20111127144648/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 20111127164220/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 20111127184345/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 20111127204447/
drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 20111127224816/
[chipotle:local/nutch/framework] mattmann% 

./bin/nutch readseg -list -dir crawl/segments/
NAME            GENERATED       FETCHER START           FETCHER END             
FETCHED PARSED
20111127104947  1               2011-11-27T10:49:50     2011-11-27T10:49:50     
1       1
20111127104955  31              2011-11-27T10:49:57     2011-11-27T10:49:58     
31      31
20111127105006  4898            2011-11-27T10:50:08     2011-11-27T10:51:40     
4898    4890
20111127105251  9890            2011-11-27T10:52:52     2011-11-27T11:56:06     
714     713
20111127125721  9202            2011-11-27T12:57:24     2011-11-27T14:00:17     
971     686
20111127144648  8261            2011-11-27T14:46:50     2011-11-27T15:48:25     
714     712
20111127164220  7575            2011-11-27T16:42:22     2011-11-27T17:45:50     
720     718
20111127184345  6871            2011-11-27T18:43:48     2011-11-27T19:47:11     
767     766
20111127204447  6116            2011-11-27T20:44:50     2011-11-27T21:48:07     
725     724
20111127224816  5406            2011-11-27T22:48:18     2011-11-27T23:51:33     
744     744
[chipotle:local/nutch/framework] mattmann% 

So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
extracted PDF files
that are housed in those segments. I've been playing around with ./bin/nutch 
readseg, 
and all I can say based on my initial impressions here are that it's really 
hard to 
get it to fulfill these simple requirements that I want it to do:

1. Iterate over all the segments 
  - pull out URLs that have at_download/file in them
  - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
is the readable PDF name,
the actual URL is a Plone CMS url, with little meaning)

2. for each PDF file anchor name
   - create a file in output_dir with the PDF file data read from the segment

My guess is that even at the scale of data that I'm dealing with (10s of GB), 
that it's impossible
and impractical to do anything that's not M/R here. Unfortunately there isn't a 
tool that will simply
grab me the PDF files out of the segment files and then output those into a 
director, appropriately 
named with the anchor text. Or...is there? ;-)

I'm running in Local mode, with no Hadoop cluster behind me, and with a 
Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working,
intentionally as I don't want it to be a requirement for folks to have a cluster
to do this assignment that I'm working on.

I was talking to Ken Krugler about this, and after picking his brain, I think 
that 
I'm going to have to end up writing a tool to do what I want. So, if that's the 
case, 
fine, but can someone point me in the right direction for a good starting point
for this? Ken also thought Andrzej might have like 10 magic solutions to make 
this happen, so here's hoping he's out there listening :-)

Thanks for the help, guys.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Best way to get files out of segment directories

Reply via email to