[
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560750#comment-14560750
]
Moritz Dorka commented on TIKA-1315:
------------------------------------
bq. For test 2, how did you get 1.b.III?
It's been quite a while since I authored that file. But from first glance I
suppose this happens because the restartLim is not set to the first (ordinary
case) but the second most-significant ilvl. This means the third level will
only see a reset each time an item belonging to the first level occurs. Since
"1.b", which precedes the element in question, belongs to the second level no
such reset happens and the "II" from "1.a.II" gets incremented, instead.
See the definition of
[ilvlRestartLim|https://msdn.microsoft.com/en-us/library/dd923594%28v=office.12%29.aspx]
for a more complicated explanation.
I do not know Word's XML-stuff, but given the logic hasn't changed the above
would mean you should somewhere encounter a ilvlRestartLim of 1 (not 3!)
associated with the currently applicable lvl (which may come from an override).
> Basic list support in WordExtractor
> -----------------------------------
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.6
> Reporter: Filip Bednárik
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch,
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch,
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other
> way of contacting you and I don't quite understand how you manage forks and
> pull requests (I don't think you do that). Plus I don't know your coding
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc
> documents, but TIKA doesn't support it. So I looked for solution and found
> one here:
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
> . So I adapted this solution to Apache TIKA with few fixes and improvements.
> Anyway feel free to use any of it so it can help people who struggle with
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)