tika parser not work properly with unwanted file types that passed from filters
in nutch
----------------------------------------------------------------------------------------
Key: NUTCH-1281
URL: https://issues.apache.org/jira/browse/NUTCH-1281
Project: Nutch
Issue Type: Improvement
Components: parser
Reporter: behnam nikbakht
when in parse-plugins.xml, set this property:
<mimeType name="*">
<plugin id="parse-tika" />
</mimeType>
all unwanted files that pass from all filters, refered to tika
but for some file types like .flv, tika parser has problem and hunged and cause
to fail in parse Job.
if this file types passed from regex-urlfilter and other filters, parse job
failed.
for this problem I suggest that add some properties for valid file types, and
use this code in TikaParser.java, like this:
public ParseResult getParse(Content content) {
String mimeType = content.getContentType();
+ String[]validTypes=new
String[]{"application/pdf","application/x-tika-msoffice","application/x-tika-
ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
+ boolean valid=false;
+ for(int k=0;k<validTypes.length;k++){
+ if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
+ valid=true;
+ }
+ if(!valid)
+ return new ParseStatus(ParseStatus.NOTPARSED, "Can't
parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(),
getConf());
URL base;
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira