[ 
https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067481#comment-13067481
 ] 

Hudson commented on NUTCH-1043:
-------------------------------

Integrated in Nutch-trunk #1550 (See 
[https://builds.apache.org/job/Nutch-trunk/1550/])
    NUTCH-1043 Add pattern for filtering .js in default url filters

jnioche : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147798
Files : 
* /nutch/trunk/conf/automaton-urlfilter.txt.template
* /nutch/trunk/conf/regex-urlfilter.txt.template
* /nutch/trunk/CHANGES.txt


> Add pattern for filtering .js in default url filters
> ----------------------------------------------------
>
>                 Key: NUTCH-1043
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1043
>             Project: Nutch
>          Issue Type: Task
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1043.patch
>
>
> The Javascript parser is not used by default as it is extremely noisy, 
> however the default URL filters do not filter out URLs ending in .js and the 
> default parser (Tika) can't parse them. In a nutshell we are fetching URLS 
> that we know can't be parsed.
> I suggest that we add a regex to the default URL filters. If people are 
> interested in fetching and parsing .js files they can activate the plugin in 
> their conf and remove the regex in the URL filters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to