[
https://issues.apache.org/jira/browse/TIKA-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912666#action_12912666
]
Nick Burch commented on TIKA-385:
---------------------------------
This support has been added as part of TIKA-506. If you apply the patch from
that issue, and use a svn build of POI, you'll see your hyperlinks in the
correct place!
The POI 3.7 beta 3 release vote will be starting shortly, and if it passes will
be out Friday, so with any luck the patch from TIKA-506 will be applied at the
end of this week.
> Incorrect handling of hyperlinks in .docx
> -----------------------------------------
>
> Key: TIKA-385
> URL: https://issues.apache.org/jira/browse/TIKA-385
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.6
> Environment: Linux, java version "1.6.0_17", Java(TM) SE Runtime
> Environment (build 1.6.0_17-b04)
> Reporter: Liam O'Boyle
> Attachments: Internal_Search_Test.docx
>
>
> Hyperlinks are incorrectly parsed in at least some office 2007 word files.
> The attached file is one example.
> There are two problems with the handling
> - an incorrectly formatted link is generated, instead of <a
> href="http://somewhere"> you get <http://somewhere>
> - the link is in the incorrect location in the extracted text; the links in
> the attached document end up at the end of the paragraph that they were
> originally in the middle of
> Both of these issues cause problems later on when using Tika with Solr
> - the incorrect links are not picked up by the built in HTML filter classes
> - the garbled text order creates unhelpful snippets when highlighting
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.