Hello,

i had also a similar problem, my little fix was to
edit the parse-plugins.xml file. There is a the rule:

<mimeType name="*">
  <plugin id="parse-text" />
</mimeType>

Just uncomment this wilcard match. You might also check
the other rules for further unwanted content.

I don't know if this is the best place for such a change,
but it worked for me.

with best regards,

Heiko Dietze

Eugen Kochuev wrote:
Any information on this? I really need to limit nutch in indexing
(only textual formats, excluding css, javascript and other non human
oriented data)


Nutch is trying to crawl everything, including DLL, EXE and all
non-textual formats. How to limit nutch to only some desirable
content-types? I know it's possible to do this by editing urlfilter
plugin settings, but it's hard to predetermine all the possible
extensions and this technique is unreliable.
Is it possible to limit crawler to fetch only some definite
content-types or at least have only them indexed?



Reply via email to