[ 
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380453#comment-14380453
 ] 

Steve Gullion edited comment on TIKA-1440 at 3/25/15 6:38 PM:
--------------------------------------------------------------

This was cut and pasted from Word (spaces added for indentation):
--------------------------------------
1.      This is the first paragraph
2.      This is the second paragraph
     a.    This is subparagraph 2(a).
     b.    This is subparagraph 2(b).
3.      This is the third paragraph.
---------------------------------------

This is the Tika output:
---------------------------------------
This is the first paragraph
This is the second paragraph
This is subparagraph 2(a).
This is subparagraph 2(b).
This is the third paragraph.
--------------------------------------

Expected output:
--------------------------------------
1.      This is the first paragraph
2.      This is the second paragraph
a.      This is subparagraph 2(a).
b.      This is subparagraph 2(b).
3.      This is the third paragraph.

(In a perfect world it would also include the tabs, but that's a different 
issue.)


was (Author: gullbyrd):
This was cut and pasted from Word (spaces added for indentation):
--------------------------------------
1.      This is the first paragraph
2.      This is the second paragraph
     a.    This is subparagraph 2(a).
    b.  This is subparagraph 2(b).
3.      This is the third paragraph.
---------------------------------------

This is the Tika output:
---------------------------------------
This is the first paragraph
This is the second paragraph
This is subparagraph 2(a).
This is subparagraph 2(b).
This is the third paragraph.
--------------------------------------

Expected output:
--------------------------------------
1.      This is the first paragraph
2.      This is the second paragraph
a.      This is subparagraph 2(a).
b.      This is subparagraph 2(b).
3.      This is the third paragraph.

(In a perfect world it would also include the tabs, but that's a different 
issue.)

> Auto-Paragraph numbers not extracted from Word Document 
> --------------------------------------------------------
>
>                 Key: TIKA-1440
>                 URL: https://issues.apache.org/jira/browse/TIKA-1440
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>         Environment: Windows 7, Windows Server 2008, Tomcat
>            Reporter: Steve Gullion
>            Priority: Minor
>              Labels: numbering, paragraph, word
>
> When the text is extracted from a Microsoft Word document that uses automatic 
> numbering, the text of the automatic numbers is not extracted. As the numbers 
> can be critical to the meaning of the document (as in the case of 
> cross-references), they should be calculated and extracted if at all possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to