Here's a real use case too: ./bin/nutch org.apache.nutch.parse.ParserChecker http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
That produces, as one of its outlinks: [chipotle:local/nutch/framework] mattmann% ./bin/nutch org.apache.nutch.parse.ParserChecker http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep download outlink: toUrl: http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file anchor: watergat1summary.pdf [chipotle:local/nutch/framework] mattmann% That's correct. However, it doesn't seem like this outlink is being read at least during the fetch/generate/crawl cycle, as I never get it picked up in my crawl. Nutch (and parse-tika) seem to parse the URL just fine b/c if I run ParserChecker direct to that URL, I see: [chipotle:local/nutch/framework] mattmann% ./bin/nutch org.apache.nutch.parse.ParserChecker http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file fetching: http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file parsing: http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file contentType: application/pdf --------- Url --------------- http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file--------- ParseData --------- Version: 5 Status: success(1,0) Title: Watergate Summary Part 01 of 02 Outlinks: 2 outlink: toUrl: Li:92 anchor: outlink: toUrl: u92.:n. anchor: Content Metadata: Date=Thu, 24 Nov 2011 04:49:42 GMT Content-Length=6354860 Expires=Thu, 01 Dec 2011 04:46:57 GMT Content-Disposition=attachment; filename="watergat1summary.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=123 Creation-Date=2000-02-16T22:44:25Z created=Wed Feb 16 14:44:25 PST 2000 Author=FBI producer=Acrobat PDFWriter 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-Modified=2011-11-08T01:41:01Z Content-Type=application/pdf creator=FBI [chipotle:local/nutch/framework] mattmann% I'll keep digging. I wonder if it's a regex thing. I commented out *everything* in my regex-urlfilter.txt besides: +^http://([a-z0-9]*\.)*vault.fbi.gov/ It seems to get EVERYTHING on the site *but* these dang at_download URLs. Cheers, Chris On Nov 23, 2011, at 5:48 PM, Mattmann, Chris A (388J) wrote: > OK, it didn't work again: here are the URLs from a full crawl cycle: > > http://pastebin.com/Jx3Ar6Md > > When run independently, where I seed it with an *at_download* URL, > direct to the PDF, it parses the PDF. But when I run it like normal with topN > 10 and > depth 10, it doesn't pick them up. > > /me stumped > > I'll poke around in the code but was just wondering if I was doing something > wrong. > > Cheers, > Chris > > On Nov 23, 2011, at 4:27 PM, Mattmann, Chris A (388J) wrote: > >> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out >> how to make it work in 1.4 (instead of editing the global, top-level >> conf/nutch-default.xml, >> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is forging >> ahead. >> >> I'll report back on if I'm able to grab the PDFs or not, using 1.4... >> >> Cheers, >> Chris >> >> On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote: >> >>> *really* weird. >>> >>> With 1.4, even though I have my http.agent.name property set in >>> conf/nutch-default.xml, >>> it keeps telling me this: >>> >>> Fetcher: No agents listed in 'http.agent.name' property. >>> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No >>> agents listed in 'http.agent.name' property. >>> at >>> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261) >>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166) >>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:136) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>> [chipotle:local/nutch/framework] mattmann% >>> >>> When I try and crawl. >>> >>> Is nutch-default.xml not read by the crawl command in 1.4? >>> >>> Cheers, >>> Chris >>> >>> >>> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote: >>> >>>> Can you also try with trunk or 1.4? I get different output with >>>> parsechecker >>>> such as a proper title. >>>> >>>> >>>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch >>>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>> of-02/at_download/file >>>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>> of-02/at_download/file >>>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>> of-02/at_download/file >>>> contentType: application/pdf >>>> signature: 818fd03d7f9011b4f7000657e2aaf966 >>>> --------- >>>> Url >>>> --------------- >>>> http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>> of-02/at_download/file--------- >>>> ParseData >>>> --------- >>>> Version: 5 >>>> Status: success(1,0) >>>> Title: Watergate Summary Part 02 of 02 >>>> Outlinks: 0 >>>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT >>>> Content-Length=1228493 >>>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; >>>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT >>>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf >>>> Server=HTML >>>> Cache-Control=max-age=604800 >>>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z >>>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter >>>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last- >>>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf >>>> creator=FBI >>>> >>>> >>>> >>>>> Hey Markus, >>>>> >>>>> I set the http.content.limit to -1, so it shouldn't have a limit. >>>>> >>>>> I'll try injecting that single URL and see if I can get it to download >>>>> using separate commands and see what happens! :-) >>>>> >>>>> Cheers, >>>>> Chris >>>>> >>>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote: >>>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file? >>>>>> Can you also check without merging segments? Or as a last resort, inject >>>>>> that single URL in an empty crawl db and do a single crawl cycle, >>>>>> preferably by using separate commands instead of the crawl command? >>>>>> >>>>>>> Hey Guys, >>>>>>> >>>>>>> I'm using Nutch 1.3, and trying to get it to crawl: >>>>>>> >>>>>>> http://vault.fbi.gov/ >>>>>>> >>>>>>> My regex-url filter diff is: >>>>>>> >>>>>>> # accept anything else >>>>>>> #+. >>>>>>> >>>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/ >>>>>>> >>>>>>> I'm trying to get it to parse PDFs like: >>>>>>> >>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>> ad/ file >>>>>>> >>>>>>> I see that my config ParserChecker lets me parse it OK: >>>>>>> >>>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch >>>>>>> org.apache.nutch.parse.ParserChecker >>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>> ad /file fetching: >>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>> ad /file parsing: >>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>> ad /file contentType: application/pdf >>>>>>> --------- >>>>>>> Url >>>>>>> --------------- >>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>> ad/ file--------- ParseData >>>>>>> --------- >>>>>>> Version: 5 >>>>>>> Status: success(1,0) >>>>>>> Title: >>>>>>> Outlinks: 0 >>>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT >>>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT >>>>>>> Content-Disposition=attachment; filename="watergat2.pdf" >>>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close >>>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML >>>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 >>>>>>> Content-Type=application/pdf >>>>>>> >>>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in >>>>>>> terms of the plugin.includes (as it looks like parse-tika) is included >>>>>>> and handles * contentType. >>>>>>> >>>>>>> I see in my crawl log if I merge the segs, and dump them and then grep >>>>>>> for URL, I see it getting to like: >>>>>>> >>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view >>>>>>> >>>>>>> That type of URL, but then not grabbing the PDF once it parses it, or >>>>>>> adding it to the outlinks, as I never see a: >>>>>>> >>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil >>>>>>> e >>>>>>> >>>>>>> In the URL list. >>>>>>> >>>>>>> I'm running this command to crawl: >>>>>>> >>>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 >>>>>>> >>>>>>> Any idea what I'm doing wrong? >>>>>>> >>>>>>> Cheers >>>>>>> Chris >>>>>>> >>>>>>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Chris Mattmann, Ph.D. >>>>>>> Senior Computer Scientist >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>> Email: chris.a.mattm...@nasa.gov >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Senior Computer Scientist >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 171-266B, Mailstop: 171-246 >>>>> Email: chris.a.mattm...@nasa.gov >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Assistant Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: chris.a.mattm...@nasa.gov >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++