[jira] [Comment Edited] (NUTCH-1925) Upgrade Tika to version 1.7

Tyler Palsulich (JIRA) Tue, 17 Feb 2015 12:26:07 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324669#comment-14324669
 ]


Tyler Palsulich edited comment on NUTCH-1925 at 2/17/15 8:23 PM:
-----------------------------------------------------------------

Reopening since this is causing a test failure in the 2.x branch:
{code}
java.lang.NullPointerException
        at 
org.apache.nutch.parse.tika.TestImageMetadata.testIt(TestImageMetadata.java:73)
{code}

The relevant lines of the test are:
{code}
      parse = new ParseUtil(conf).parse(urlString, page);
      ByteBuffer bbufW = page.getMetadata().get(new Utf8("width"));
      byte[] byteArrayW = new byte[bbufW.remaining()];  // <-- NPE
{code}

{{page.getMetadata().keySet()}} does not have "width" or "height." But, they 
are extracted when running Tika directly (and on the 1.x branch).

I'm investigating why right now. But, seem to be going in circles.

Edit: Okay... So, for some reason, Tika's new GDAL Parser is being selected as 
the Parser to use. That's why the metadata has keys like "Coordinate System." 
We really want the ImageParser to be selected. I'm not sure why Tika selects a 
different parser between the two Nutch branches. Nutch must call Tika slightly 
differently between the two, leading to different (probably unintentional) 
semantics from Tika.

Going to keep investigating. But, wanted to give an update.

Edit 2: The GribParser is selected by using the deprecated 
[TikaConfig#getParser(MediaType)|https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L237]
 method. If you just create a new AutoDetectParser and ignore the mime type 
hint, the correct underlying parser is selected and the test passes.

This might break a Nutch semantic I'm not aware of. But, a patch is coming up 
for the 2.x branch which ignores the mime type hint and just creates an 
AutoDetectParser.


was (Author: tpalsulich):
Reopening since this is causing a test failure in the 2.x branch:
{code}
java.lang.NullPointerException
        at 
org.apache.nutch.parse.tika.TestImageMetadata.testIt(TestImageMetadata.java:73)
{code}

The relevant lines of the test are:
{code}
      parse = new ParseUtil(conf).parse(urlString, page);
      ByteBuffer bbufW = page.getMetadata().get(new Utf8("width"));
      byte[] byteArrayW = new byte[bbufW.remaining()];  // <-- NPE
{code}

{{page.getMetadata().keySet()}} does not have "width" or "height." But, they 
are extracted when running Tika directly (and on the 1.x branch).

I'm investigating why right now. But, seem to be going in circles.

Edit: Okay... So, for some reason, Tika's new GDAL Parser is being selected as 
the Parser to use. That's why the metadata has keys like "Coordinate System." 
We really want the ImageParser to be selected. I'm not sure why Tika selects a 
different parser between the two Nutch branches. Nutch must call Tika slightly 
differently between the two, leading to different (probably unintentional) 
semantics from Tika.

Going to keep investigating. But, wanted to give an update.

> Upgrade Tika to version 1.7
> ---------------------------
>
>                 Key: NUTCH-1925
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1925
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>            Reporter: Tyler Palsulich
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.10, 2.3.1
>
>         Attachments: NUTCH-1925-2x.patch, NUTCH-1925.palsulich.patch, 
> NUTCH-1925.palsulich.v2.patch
>
>
> Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
> API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (NUTCH-1925) Upgrade Tika to version 1.7

Reply via email to