[
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512858#comment-14512858
]
Chris A. Mattmann commented on NUTCH-1994:
------------------------------------------
Hey [~jorgelbg] I thought it was NUTCH-1991 but that appears to be a red
herring. This first appeared on the commit of NUTCH-1994 and I have been
working on this all day to try and figure out if it was due to NUTCH-1991 and
it seems that it wasn't.
I'm down to this error in parse-zip (excuse my System.out.printlns):
{noformat}
2015-04-25 20:31:54,378 INFO conf.Configuration
(Configuration.java:getConfResourceAsInputStream(1017)) - found resource
parse-plugins.xml at file:/Users/mattmann/src/nutch/conf/parse-plugins.xml
2015-04-25 20:31:54,408 INFO conf.Configuration
(Configuration.java:getConfResourceAsInputStream(1017)) - found resource
parse-plugins.xml at file:/Users/mattmann/src/nutch/conf/parse-plugins.xml
2015-04-25 20:31:54,414 INFO parse.ParserFactory
(ParserFactory.java:matchExtensions(376)) - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes
system property, and all claim to support the content type text/plain, but they
are not mapped to it in the parse-plugins.xml file
PARSER RETRIEVED! NULL!
2015-04-25 20:31:54,473 ERROR tika.TikaParser (TikaParser.java:getParse(86)) -
Can't retrieve Tika parser for mime-type text/plain
RESULT TEXT! textfile.txt
HERE IS THE PARSE TEXT textfile.txt
{noformat}
So, looks like on getParse in TikaParser.java, it can't retrieve the Tika
parser for text/plain (the zip file in the sample directory for parse-zip
contains a single text file, textfile.txt, which contains the expected text).
Since the appropriate Tika parser can't be retrieved, the parser only extracts
the filename, and not the text as well hence the test is failing.
Trying to figure out why it can't find the Tika parser for Tika 1.8 for
text/plain.
> Upgrade to Apache Tika 1.8
> --------------------------
>
> Key: NUTCH-1994
> URL: https://issues.apache.org/jira/browse/NUTCH-1994
> Project: Nutch
> Issue Type: Improvement
> Components: build, parser
> Affects Versions: 1.10, 2.3.1
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch
>
>
> Tika 1.8 was released this morning.
> Lets upgrade then release Nutch trunk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)