[ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1281:
----------------------------------------

    Fix Version/s: 2.2
                   1.7
    
> tika parser not work properly with unwanted file types that passed from 
> filters in nutch
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: behnam nikbakht
>             Fix For: 1.7, 2.2
>
>
> when in parse-plugins.xml, set this property:
> <mimeType name="*">
>         <plugin id="parse-tika" />
> </mimeType>
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and 
> cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job 
> failed.
> for this problem I suggest that add some properties for valid file types, and 
> use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
>               String mimeType = content.getContentType();
> +             String[]validTypes=new 
> String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- 
> ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> +             boolean valid=false;
> +             for(int k=0;k<validTypes.length;k++){
> +                     if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> +                             valid=true;
> +             }
> +             if(!valid)
> +                     return new ParseStatus(ParseStatus.NOTPARSED, "Can't 
> parse for unwanted filetype "+ 
> mimeType).getEmptyParseResult(content.getUrl(), getConf());
>       
>               URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to