Hi Martin,

Have you found a solution to your problem? I'm facing the same issue - I
want to crawl all links but index only specific content like pdf or msword.
Seems like the the nutch concept is slightly different here then most people
understand. First of all crawl-urlfilter.txt is not help as it will filter
out paged which may contain the typed content. Second, nutch-site.xml is no
help either, as "parse" and "index" elements aren't meant for the
configuration we a looking for. "parse" defines what content type is parsed,
so we want to have html|pdf|msword there, but "index" element probably
describes something different then type of content, as follows from its
values: basic|more, which are not really well documented, except
recommendations to use them in a particular situation. I afraid the only way
to figure this out is to lookup the code. Anyway, if you'll find a solution
please share. Thanks.


Martin Kammerlander-2 wrote:
> 
> original post removed
> 

-- 
View this message in context: 
http://www.nabble.com/indexing-only-special-documents-tp10994592p16265546.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to