[
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142495#comment-14142495
]
Moritz Dorka edited comment on TIKA-1315 at 9/22/14 8:14 AM:
-------------------------------------------------------------
Hmm, apparently, files are global to a bug in Jira and are not linked to
specific comments... Too bad. So this is related to [^ListManager.tar.bz2] and
[^ListNumbering.patch] which I propose as substitutes for Filip's work.
----
\\
The original patch proposed by Filip is quite good but
* it lacks true support for ListFormatOverrideLevels (which, admittedly, is a
really brain-twisting feature of Word)
* it does not cope correctly with bullets / unnumbered items (i.e. stuff which
has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists
* there is no support for legal formatting and
* no support for levels which restart at arbitrary more-significant levels.
Attached is a an improved version of the numbering algorithm written from
scratch, with the exception of two helper methods ({{intToRoman()}} +
{{intToLetter()}}) which are still based on the original blog post cited by
Filip. I consider them rather trivial, so it is hopefully not a problem to
include them in tika.
The code is an attempt to fully implement the algorithm outlined in MS-DOC,
v20140721,
[2.4.6.3|http://msdn.microsoft.com/en-us/library/dd921056%28v=office.12%29.aspx]
+
[2.4.6.4|http://msdn.microsoft.com/en-us/library/dd945275%28v=office.12%29.aspx].
Downside of my approach is that it IMHO externalizes quite a bit of
functionality which should actually be inside POI. Since those
ListLevelOverrides can also influence the overall formatting of the paragraph
(something which is handled by POI) this can lead to inconsistent behaviour.
The current testcase ({{WordParserTest.java}}) has an rather bad coverage for
the proposed new algorithm. I have a better test file here which reaches about
80% (the rest being mostly error handling stuff). Give me a shout if you want
that to be included in tika as well.
Make sure to apply [this
patch|https://issues.apache.org/bugzilla/show_bug.cgi?id=56998] to POI before
using this.
was (Author: morido):
Hmm, apparently files are global to a bug in Jira and are not linked to
specific comments... Too bad. So this is related to ListManager.tar.bz2 and
ListNumbering.patch which I propose as substitutes for Filip's work.
The original patch proposed by Filip is quite good but it lacks true support
for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting
feature of Word), it does not cope correctly with bullets / unnumbered items
(i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of
multilevel lists and there is no support for either legal formatting or levels
which restart at arbitrary more-significant levels.
Attached is a an improved version of the numbering algorithm written from
scratch, with the exception of two helper methods (intToRoman() +
intToLetter()) which are still based on the original blog post cited by Filip.
I consider them rather trivial, so it is hopefully not a problem to include
them in tika.
The code is an attempt to fully implement the algorithm outlined in [MS-DOC],
v20140721, 2.4.6.3 + 2.4.6.4.
Downside of my approach is that it IMHO externalizes quite a bit of
functionality which should actually be inside POI. Since those
ListLevelOverrides can also influence the overall formatting of the paragraph
(something which is handled by POI) this can lead to inconsistent behaviour.
The current testcase (WordParserTest.java) has an rather bad coverage for the
proposed new algorithm. I have a better test file here which reaches about 80%
(the rest being mostly error handling stuff). Give me a shout if you want that
to be included in tika as well.
Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to
POI before using this.
> Basic list support in WordExtractor
> -----------------------------------
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.6
> Reporter: Filip Bednárik
> Priority: Minor
> Fix For: 1.7
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch,
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other
> way of contacting you and I don't quite understand how you manage forks and
> pull requests (I don't think you do that). Plus I don't know your coding
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc
> documents, but TIKA doesn't support it. So I looked for solution and found
> one here:
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
> . So I adapted this solution to Apache TIKA with few fixes and improvements.
> Anyway feel free to use any of it so it can help people who struggle with
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)