[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

Steve Gullion (JIRA) Wed, 25 Mar 2015 11:37:41 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380453#comment-14380453
 ]


Steve Gullion commented on TIKA-1440:
-------------------------------------

This was cut and pasted from Word (spaces added for indentation):
--------------------------------------
1.      This is the first paragraph
2.      This is the second paragraph
    a.  This is subparagraph 2(a).
    b.  This is subparagraph 2(b).
3.      This is the third paragraph.
---------------------------------------

This is the Tika output:
---------------------------------------
This is the first paragraph
This is the second paragraph
This is subparagraph 2(a).
This is subparagraph 2(b).
This is the third paragraph.
--------------------------------------

Expected output:
--------------------------------------
1.      This is the first paragraph
2.      This is the second paragraph
a.      This is subparagraph 2(a).
b.      This is subparagraph 2(b).
3.      This is the third paragraph.

(In a perfect world it would also include the tabs, but that's a different 
issue.)

> Auto-Paragraph numbers not extracted from Word Document 
> --------------------------------------------------------
>
>                 Key: TIKA-1440
>                 URL: https://issues.apache.org/jira/browse/TIKA-1440
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>         Environment: Windows 7, Windows Server 2008, Tomcat
>            Reporter: Steve Gullion
>            Priority: Minor
>              Labels: numbering, paragraph, word
>
> When the text is extracted from a Microsoft Word document that uses automatic 
> numbering, the text of the automatic numbers is not extracted. As the numbers 
> can be critical to the meaning of the document (as in the case of 
> cross-references), they should be calculated and extracted if at all possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

Reply via email to