Re: Can't get Nutch to crawl PDFs

Mattmann, Chris A (388J) Wed, 23 Nov 2011 20:54:05 -0800

Here's a real use case too:

./bin/nutch org.apache.nutch.parse.ParserChecker 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view


That produces, as one of its outlinks:

[chipotle:local/nutch/framework] mattmann% ./bin/nutch 
org.apache.nutch.parse.ParserChecker 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
download
  outlink: toUrl: 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file 
anchor: watergat1summary.pdf
[chipotle:local/nutch/framework] mattmann% 

That's correct. However, it doesn't seem like this outlink is being read at 
least during the fetch/generate/crawl cycle, as 
I never get it picked up in my crawl. Nutch (and parse-tika) seem to parse the 
URL just fine b/c if I run ParserChecker
direct to that URL, I see:

[chipotle:local/nutch/framework] mattmann% ./bin/nutch 
org.apache.nutch.parse.ParserChecker 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file 
 fetching: 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
parsing: 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
contentType: application/pdf
---------
Url
---------------
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Watergate Summary Part 01 of 02
Outlinks: 2
  outlink: toUrl: Li:92 anchor: 
  outlink: toUrl: u92.:n. anchor: 
Content Metadata: Date=Thu, 24 Nov 2011 04:49:42 GMT Content-Length=6354860 
Expires=Thu, 01 Dec 2011 04:46:57 GMT Content-Disposition=attachment; 
filename="watergat1summary.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML 
Cache-Control=max-age=604800 
Parse Metadata: xmpTPg:NPages=123 Creation-Date=2000-02-16T22:44:25Z 
created=Wed Feb 16 14:44:25 PST 2000 Author=FBI producer=Acrobat PDFWriter 2.01 
for Windows; modified using iText 2.1.7 by 1T3XT 
Last-Modified=2011-11-08T01:41:01Z Content-Type=application/pdf creator=FBI 
[chipotle:local/nutch/framework] mattmann% 

I'll keep digging. I wonder if it's a regex thing. I commented out *everything* 
in my regex-urlfilter.txt besides:

+^http://([a-z0-9]*\.)*vault.fbi.gov/

It seems to get EVERYTHING on the site *but* these dang at_download URLs. 

Cheers,
Chris

On Nov 23, 2011, at 5:48 PM, Mattmann, Chris A (388J) wrote:

> OK, it didn't work again: here are the URLs from a full crawl cycle:
> 
> http://pastebin.com/Jx3Ar6Md
> 
> When run independently, where I seed it with an *at_download* URL, 
> direct to the PDF, it parses the PDF. But when I run it like normal with topN 
> 10 and
> depth 10, it doesn't pick them up. 
> 
> /me stumped
> 
> I'll poke around in the code but was just wondering if I was doing something
> wrong.
> 
> Cheers,
> Chris
> 
> On Nov 23, 2011, at 4:27 PM, Mattmann, Chris A (388J) wrote:
> 
>> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out
>> how to make it work in 1.4 (instead of editing the global, top-level 
>> conf/nutch-default.xml, 
>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is forging 
>> ahead.
>> 
>> I'll report back on if I'm able to grab the PDFs or not, using 1.4...
>> 
>> Cheers,
>> Chris
>> 
>> On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote:
>> 
>>> *really* weird.
>>> 
>>> With 1.4, even though I have my http.agent.name property set in 
>>> conf/nutch-default.xml, 
>>> it keeps telling me this:
>>> 
>>> Fetcher: No agents listed in 'http.agent.name' property.
>>> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No 
>>> agents listed in 'http.agent.name' property.
>>>     at 
>>> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
>>>     at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
>>>     at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>> [chipotle:local/nutch/framework] mattmann% 
>>> 
>>> When I try and crawl.
>>> 
>>> Is nutch-default.xml not read by the crawl command in 1.4?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> 
>>> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote:
>>> 
>>>> Can you also try with trunk or 1.4?  I get different output with 
>>>> parsechecker 
>>>> such as a proper title.
>>>> 
>>>> 
>>>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch 
>>>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>> of-02/at_download/file
>>>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>> of-02/at_download/file
>>>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>> of-02/at_download/file
>>>> contentType: application/pdf
>>>> signature: 818fd03d7f9011b4f7000657e2aaf966
>>>> ---------
>>>> Url
>>>> ---------------
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>> of-02/at_download/file---------
>>>> ParseData
>>>> ---------
>>>> Version: 5
>>>> Status: success(1,0)
>>>> Title: Watergate Summary Part 02 of 02
>>>> Outlinks: 0
>>>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT 
>>>> Content-Length=1228493 
>>>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; 
>>>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
>>>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf 
>>>> Server=HTML 
>>>> Cache-Control=max-age=604800 
>>>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z 
>>>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter 
>>>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-
>>>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf 
>>>> creator=FBI
>>>> 
>>>> 
>>>> 
>>>>> Hey Markus,
>>>>> 
>>>>> I set the http.content.limit to -1, so it shouldn't have a limit.
>>>>> 
>>>>> I'll try injecting that single URL and see if I can get it to download
>>>>> using separate commands and see what happens! :-)
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:
>>>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file?
>>>>>> Can you also check without merging segments? Or as a last resort, inject
>>>>>> that single URL in an empty crawl db and do a single crawl cycle,
>>>>>> preferably by using separate commands instead of the crawl command?
>>>>>> 
>>>>>>> Hey Guys,
>>>>>>> 
>>>>>>> I'm using Nutch 1.3, and trying to get it to crawl:
>>>>>>> 
>>>>>>> http://vault.fbi.gov/
>>>>>>> 
>>>>>>> My regex-url filter diff is:
>>>>>>> 
>>>>>>> # accept anything else
>>>>>>> #+.
>>>>>>> 
>>>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/
>>>>>>> 
>>>>>>> I'm trying to get it to parse PDFs like:
>>>>>>> 
>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>> ad/ file
>>>>>>> 
>>>>>>> I see that my config ParserChecker lets me parse it OK:
>>>>>>> 
>>>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
>>>>>>> org.apache.nutch.parse.ParserChecker
>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>> ad /file fetching:
>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>> ad /file parsing:
>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>> ad /file contentType: application/pdf
>>>>>>> ---------
>>>>>>> Url
>>>>>>> ---------------
>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>> ad/ file--------- ParseData
>>>>>>> ---------
>>>>>>> Version: 5
>>>>>>> Status: success(1,0)
>>>>>>> Title:
>>>>>>> Outlinks: 0
>>>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT
>>>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT
>>>>>>> Content-Disposition=attachment; filename="watergat2.pdf"
>>>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close
>>>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
>>>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
>>>>>>> Content-Type=application/pdf
>>>>>>> 
>>>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in
>>>>>>> terms of the plugin.includes (as it looks like parse-tika) is included
>>>>>>> and handles * contentType.
>>>>>>> 
>>>>>>> I see in my crawl log if I merge the segs, and dump them and then grep
>>>>>>> for URL, I see it getting to like:
>>>>>>> 
>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
>>>>>>> 
>>>>>>> That type of URL, but then not grabbing the PDF once it parses it, or
>>>>>>> adding it to the outlinks, as I never see a:
>>>>>>> 
>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil
>>>>>>> e
>>>>>>> 
>>>>>>> In the URL list.
>>>>>>> 
>>>>>>> I'm running this command to crawl:
>>>>>>> 
>>>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
>>>>>>> 
>>>>>>> Any idea what I'm doing wrong?
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Chris
>>>>>>> 
>>>>>>> 
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Senior Computer Scientist
>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>> Email: chris.a.mattm...@nasa.gov
>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Senior Computer Scientist
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 171-266B, Mailstop: 171-246
>>>>> Email: chris.a.mattm...@nasa.gov
>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattm...@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to