Hi Martin, Have you found a solution to your problem? I'm facing the same issue - I want to crawl all links but index only specific content like pdf or msword. Seems like the the nutch concept is slightly different here then most people understand. First of all crawl-urlfilter.txt is not help as it will filter out paged which may contain the typed content. Second, nutch-site.xml is no help either, as "parse" and "index" elements aren't meant for the configuration we a looking for. "parse" defines what content type is parsed, so we want to have html|pdf|msword there, but "index" element probably describes something different then type of content, as follows from its values: basic|more, which are not really well documented, except recommendations to use them in a particular situation. I afraid the only way to figure this out is to lookup the code. Anyway, if you'll find a solution please share. Thanks.
Martin Kammerlander-2 wrote: > > original post removed > -- View this message in context: http://www.nabble.com/indexing-only-special-documents-tp10994592p16265546.html Sent from the Nutch - User mailing list archive at Nabble.com.
