[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512858#comment-14512858
 ] 

Chris A. Mattmann commented on NUTCH-1994:
------------------------------------------

Hey [~jorgelbg] I thought it was NUTCH-1991 but that appears to be a red 
herring. This first appeared on the commit of NUTCH-1994 and I have been 
working on this all day to try and figure out if it was due to NUTCH-1991 and 
it seems that it wasn't. 

I'm down to this error in parse-zip (excuse my System.out.printlns):

{noformat}
2015-04-25 20:31:54,378 INFO  conf.Configuration 
(Configuration.java:getConfResourceAsInputStream(1017)) - found resource 
parse-plugins.xml at file:/Users/mattmann/src/nutch/conf/parse-plugins.xml
2015-04-25 20:31:54,408 INFO  conf.Configuration 
(Configuration.java:getConfResourceAsInputStream(1017)) - found resource 
parse-plugins.xml at file:/Users/mattmann/src/nutch/conf/parse-plugins.xml
2015-04-25 20:31:54,414 INFO  parse.ParserFactory 
(ParserFactory.java:matchExtensions(376)) - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type text/plain, but they 
are not mapped to it  in the parse-plugins.xml file
PARSER RETRIEVED! NULL!
2015-04-25 20:31:54,473 ERROR tika.TikaParser (TikaParser.java:getParse(86)) - 
Can't retrieve Tika parser for mime-type text/plain
RESULT TEXT! textfile.txt  
HERE IS THE PARSE TEXT textfile.txt  
{noformat}

So, looks like on getParse in TikaParser.java, it can't retrieve the Tika 
parser for text/plain (the zip file in the sample directory for parse-zip 
contains a single text file, textfile.txt, which contains the expected text). 
Since the appropriate Tika parser can't be retrieved, the parser only extracts 
the filename, and not the text as well hence the test is failing.

Trying to figure out why it can't find the Tika parser for Tika 1.8 for 
text/plain.

> Upgrade to Apache Tika 1.8
> --------------------------
>
>                 Key: NUTCH-1994
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1994
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build, parser
>    Affects Versions: 1.10, 2.3.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.10, 2.3.1
>
>         Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch
>
>
> Tika 1.8 was released this morning.
> Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to