Add pattern for filtering .js in default url filters
----------------------------------------------------

                 Key: NUTCH-1043
                 URL: https://issues.apache.org/jira/browse/NUTCH-1043
             Project: Nutch
          Issue Type: Task
    Affects Versions: 1.4, 2.0
            Reporter: Julien Nioche
            Priority: Minor
             Fix For: 1.4, 2.0


The Javascript parser is not used by default as it is extremely noisy, however 
the default URL filters do not filter out URLs ending in .js and the default 
parser (Tika) can't parse them. In a nutshell we are fetching URLS that we know 
can't be parsed.
I suggest that we add a regex to the default URL filters. If people are 
interested in fetching and parsing .js files they can activate the plugin in 
their conf and remove the regex in the URL filters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to