[ 
https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064095#comment-13064095
 ] 

Markus Jelsma commented on NUTCH-1043:
--------------------------------------

Good reasoning, the suffix url filter is more suitable for dealing with large 
amounts of prohibited suffixes. Maybe only suffixes of formats that are known 
to be large files (tif, ram etc) should be added to the default regex filter as 
to prevent users from accidentally downloading many MiB's of unwanted (and 
unparsable) data.

> Add pattern for filtering .js in default url filters
> ----------------------------------------------------
>
>                 Key: NUTCH-1043
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1043
>             Project: Nutch
>          Issue Type: Task
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>
> The Javascript parser is not used by default as it is extremely noisy, 
> however the default URL filters do not filter out URLs ending in .js and the 
> default parser (Tika) can't parse them. In a nutshell we are fetching URLS 
> that we know can't be parsed.
> I suggest that we add a regex to the default URL filters. If people are 
> interested in fetching and parsing .js files they can activate the plugin in 
> their conf and remove the regex in the URL filters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to