Any information on this? I really need to limit nutch in indexing (only textual formats, excluding css, javascript and other non human oriented data)
> Nutch is trying to crawl everything, including DLL, EXE and all > non-textual formats. How to limit nutch to only some desirable > content-types? I know it's possible to do this by editing urlfilter > plugin settings, but it's hard to predetermine all the possible > extensions and this technique is unreliable. > Is it possible to limit crawler to fetch only some definite > content-types or at least have only them indexed?
