[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

Tim Allison (JIRA) Wed, 25 Mar 2015 12:33:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380592#comment-14380592
 ]


Tim Allison commented on TIKA-1440:
-----------------------------------

For doc, this looks useful:
[bugzilla|https://bz.apache.org/bugzilla/show_bug.cgi?id=49850] and 
[blog|http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/]

For docx, I think we can just do the calculations from:
{noformat}
paragraph.getNumFmt() 
paragraph.getNumIlvl()
{noformat}

I think I remember reviewing a patch on this for ppt or pptx a good while ago.  
I'll see if we can reuse any code from there.

> Auto-Paragraph numbers not extracted from Word Document 
> --------------------------------------------------------
>
>                 Key: TIKA-1440
>                 URL: https://issues.apache.org/jira/browse/TIKA-1440
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>         Environment: Windows 7, Windows Server 2008, Tomcat
>            Reporter: Steve Gullion
>            Priority: Minor
>              Labels: numbering, paragraph, word
>         Attachments: Tika Test.docx, Tika test 2003.doc
>
>
> When the text is extracted from a Microsoft Word document that uses automatic 
> numbering, the text of the automatic numbers is not extracted. As the numbers 
> can be critical to the meaning of the document (as in the case of 
> cross-references), they should be calculated and extracted if at all possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

Reply via email to