[
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380592#comment-14380592
]
Tim Allison commented on TIKA-1440:
-----------------------------------
For doc, this looks useful:
[bugzilla|https://bz.apache.org/bugzilla/show_bug.cgi?id=49850] and
[blog|http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/]
For docx, I think we can just do the calculations from:
{noformat}
paragraph.getNumFmt()
paragraph.getNumIlvl()
{noformat}
I think I remember reviewing a patch on this for ppt or pptx a good while ago.
I'll see if we can reuse any code from there.
> Auto-Paragraph numbers not extracted from Word Document
> --------------------------------------------------------
>
> Key: TIKA-1440
> URL: https://issues.apache.org/jira/browse/TIKA-1440
> Project: Tika
> Issue Type: Bug
> Components: parser
> Environment: Windows 7, Windows Server 2008, Tomcat
> Reporter: Steve Gullion
> Priority: Minor
> Labels: numbering, paragraph, word
> Attachments: Tika Test.docx, Tika test 2003.doc
>
>
> When the text is extracted from a Microsoft Word document that uses automatic
> numbering, the text of the automatic numbers is not extracted. As the numbers
> can be critical to the meaning of the document (as in the case of
> cross-references), they should be calculated and extracted if at all possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)