[ https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067481#comment-13067481 ]
Hudson commented on NUTCH-1043: ------------------------------- Integrated in Nutch-trunk #1550 (See [https://builds.apache.org/job/Nutch-trunk/1550/]) NUTCH-1043 Add pattern for filtering .js in default url filters jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147798 Files : * /nutch/trunk/conf/automaton-urlfilter.txt.template * /nutch/trunk/conf/regex-urlfilter.txt.template * /nutch/trunk/CHANGES.txt > Add pattern for filtering .js in default url filters > ---------------------------------------------------- > > Key: NUTCH-1043 > URL: https://issues.apache.org/jira/browse/NUTCH-1043 > Project: Nutch > Issue Type: Task > Affects Versions: 1.4, 2.0 > Reporter: Julien Nioche > Priority: Minor > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1043.patch > > > The Javascript parser is not used by default as it is extremely noisy, > however the default URL filters do not filter out URLs ending in .js and the > default parser (Tika) can't parse them. In a nutshell we are fetching URLS > that we know can't be parsed. > I suggest that we add a regex to the default URL filters. If people are > interested in fetching and parsing .js files they can activate the plugin in > their conf and remove the regex in the URL filters. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira