Add pattern for filtering .js in default url filters
----------------------------------------------------
Key: NUTCH-1043
URL: https://issues.apache.org/jira/browse/NUTCH-1043
Project: Nutch
Issue Type: Task
Affects Versions: 1.4, 2.0
Reporter: Julien Nioche
Priority: Minor
Fix For: 1.4, 2.0
The Javascript parser is not used by default as it is extremely noisy, however
the default URL filters do not filter out URLs ending in .js and the default
parser (Tika) can't parse them. In a nutshell we are fetching URLS that we know
can't be parsed.
I suggest that we add a regex to the default URL filters. If people are
interested in fetching and parsing .js files they can activate the plugin in
their conf and remove the regex in the URL filters.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira