[Nutch-general] Re: PDF support? Does crawl parse p

Piotr Kosiorowski Wed, 31 Aug 2005 13:09:39 -0700

Hello Diane,

There is a plugin to parse pdf files. You have to enable it innutch-site.xml (just copy entry from nutch-default.xml).

You have to change plugin.includes property to include parse-pdf plugin:
[...] parse-(text|html) [...] to [...] parse-(text|html|pdf) [...]
Regards
Piotr


Diane Palla wrote:

Does Nutch have a way to parse pdf files, that is, "application/pdf"content type files?
I noticed a plugin variable setting in default.properties:

plugin.pdf=org.apache.nutch.parse.pdf*

I never changed this file.

Is that the right value?

I am using Nutch 0.7.

What do I have to do make parse pdf files?

When I do the crawl, I get this error with application/pdf files:
050831 145126 fetch okay, but can't parse<mainurl>/research/126900/126969/126969.pdf, reason: failed(2,203):Content-Type not text/html: application/pdf
If it's not possible, what future version of Nutch do developers expect tosupport application/pdf types and have such parsing of pdf filesavailable?
Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
[EMAIL PROTECTED]
Bryan Woliner <[EMAIL PROTECTED]>08/23/2005 05:22 PM
Please respond to
[email protected]


To
[email protected]
cc

Subject
Adding small batches of fetched URLs to a larger aggregate segment/index






Hi,
I have a number of sites that I want to crawl, then merge their segmentsandcreate a single index. One of the main reasons I want to do this is that Iwant some of the sites in my index to be crawls on a daily basis, othersona weekly basis, etc. Each time I re-crawl a site, I want to add thefetchedURLs to a single aggregate segment/index. I have a couple questions aboutdoing this:
1. Is it possible to use a different regex.urlfilter.txt file for eachsitethat I am crawling? If so, how would I do this?
2. If I have a very large segment that is indexed (my aggregate index) andIwant to add another (much smaller) set of fetched URLs to this index, whatis the best way to do this. It seems like merging the small and largesegments and then re-indexing the whole thing would be very time consuming-- especially if I wanted to add news small sets of fetched URLsfrequently.
Thanks for any suggestions you have to offer,
Bryan




-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: PDF support? Does crawl parse p

Reply via email to