[
https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-1043.
--------------------------------
Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
> Add pattern for filtering .js in default url filters
> ----------------------------------------------------
>
> Key: NUTCH-1043
> URL: https://issues.apache.org/jira/browse/NUTCH-1043
> Project: Nutch
> Issue Type: Task
> Affects Versions: 1.4, nutchgora
> Reporter: Julien Nioche
> Priority: Minor
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1043.patch
>
>
> The Javascript parser is not used by default as it is extremely noisy,
> however the default URL filters do not filter out URLs ending in .js and the
> default parser (Tika) can't parse them. In a nutshell we are fetching URLS
> that we know can't be parsed.
> I suggest that we add a regex to the default URL filters. If people are
> interested in fetching and parsing .js files they can activate the plugin in
> their conf and remove the regex in the URL filters.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira