[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1281: ---------------------------------------- Fix Version/s: 2.2 1.7 > tika parser not work properly with unwanted file types that passed from > filters in nutch > ---------------------------------------------------------------------------------------- > > Key: NUTCH-1281 > URL: https://issues.apache.org/jira/browse/NUTCH-1281 > Project: Nutch > Issue Type: Improvement > Components: parser > Reporter: behnam nikbakht > Fix For: 1.7, 2.2 > > > when in parse-plugins.xml, set this property: > <mimeType name="*"> > <plugin id="parse-tika" /> > </mimeType> > all unwanted files that pass from all filters, refered to tika > but for some file types like .flv, tika parser has problem and hunged and > cause to fail in parse Job. > if this file types passed from regex-urlfilter and other filters, parse job > failed. > for this problem I suggest that add some properties for valid file types, and > use this code in TikaParser.java, like this: > public ParseResult getParse(Content content) { > String mimeType = content.getContentType(); > + String[]validTypes=new > String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- > ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"}; > + boolean valid=false; > + for(int k=0;k<validTypes.length;k++){ > + if(validTypes[k].compareTo(mimeType.toLowerCase())==0) > + valid=true; > + } > + if(!valid) > + return new ParseStatus(ParseStatus.NOTPARSED, "Can't > parse for unwanted filetype "+ > mimeType).getEmptyParseResult(content.getUrl(), getConf()); > > URL base; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira