[jira] [Comment Edited] (TIKA-1315) Basic list support in WordExtractor

Moritz Dorka (JIRA) Mon, 22 Sep 2014 01:15:59 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142495#comment-14142495
 ]


Moritz Dorka edited comment on TIKA-1315 at 9/22/14 8:14 AM:
-------------------------------------------------------------

Hmm, apparently, files are global to a bug in Jira and are not linked to 
specific comments... Too bad. So this is related to [^ListManager.tar.bz2] and 
[^ListNumbering.patch] which I propose as substitutes for Filip's work.

----
\\
The original patch proposed by Filip is quite good but
*  it lacks true support for ListFormatOverrideLevels (which, admittedly, is a 
really brain-twisting feature of Word)
* it does not cope correctly with bullets / unnumbered items (i.e. stuff which 
has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists
* there is no support for legal formatting and
* no support for levels which restart at arbitrary more-significant levels.

Attached is a an improved version of the numbering algorithm written from 
scratch, with the exception of two helper methods ({{intToRoman()}} + 
{{intToLetter()}}) which are still based on the original blog post cited by 
Filip. I consider them rather trivial, so it is hopefully not a problem to 
include them in tika.
The code is an attempt to fully implement the algorithm outlined in MS-DOC, 
v20140721, 
[2.4.6.3|http://msdn.microsoft.com/en-us/library/dd921056%28v=office.12%29.aspx]
 + 
[2.4.6.4|http://msdn.microsoft.com/en-us/library/dd945275%28v=office.12%29.aspx].

Downside of my approach is that it IMHO externalizes quite a bit of 
functionality which should actually be inside POI. Since those 
ListLevelOverrides can also influence the overall formatting of the paragraph 
(something which is handled by POI) this can lead to inconsistent behaviour.

The current testcase ({{WordParserTest.java}}) has an rather bad coverage for 
the proposed new algorithm. I have a better test file here which reaches about 
80% (the rest being mostly error handling stuff). Give me a shout if you want 
that to be included in tika as well.

Make sure to apply [this 
patch|https://issues.apache.org/bugzilla/show_bug.cgi?id=56998] to POI before 
using this.


was (Author: morido):
Hmm, apparently files are global to a bug in Jira and are not linked to 
specific comments... Too bad. So this is related to ListManager.tar.bz2 and 
ListNumbering.patch which I propose as substitutes for Filip's work.

The original patch proposed by Filip is quite good but it lacks true support 
for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting 
feature of Word), it does not cope correctly with bullets / unnumbered items 
(i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of 
multilevel lists and there is no support for either legal formatting or levels 
which restart at arbitrary more-significant levels.

Attached is a an improved version of the numbering algorithm written from 
scratch, with the exception of two helper methods (intToRoman() + 
intToLetter()) which are still based on the original blog post cited by Filip. 
I consider them rather trivial, so it is hopefully not a problem to include 
them in tika.
The code is an attempt to fully implement the algorithm outlined in [MS-DOC], 
v20140721, 2.4.6.3 + 2.4.6.4.

Downside of my approach is that it IMHO externalizes quite a bit of 
functionality which should actually be inside POI. Since those 
ListLevelOverrides can also influence the overall formatting of the paragraph 
(something which is handled by POI) this can lead to inconsistent behaviour.

The current testcase (WordParserTest.java) has an rather bad coverage for the 
proposed new algorithm. I have a better test file here which reaches about 80% 
(the rest being mostly error handling stuff). Give me a shout if you want that 
to be included in tika as well.

Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to 
POI before using this.

> Basic list support in WordExtractor
> -----------------------------------
>
>                 Key: TIKA-1315
>                 URL: https://issues.apache.org/jira/browse/TIKA-1315
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Filip Bednárik
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1315) Basic list support in WordExtractor

Reply via email to