[
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380453#comment-14380453
]
Steve Gullion commented on TIKA-1440:
-------------------------------------
This was cut and pasted from Word (spaces added for indentation):
--------------------------------------
1. This is the first paragraph
2. This is the second paragraph
a. This is subparagraph 2(a).
b. This is subparagraph 2(b).
3. This is the third paragraph.
---------------------------------------
This is the Tika output:
---------------------------------------
This is the first paragraph
This is the second paragraph
This is subparagraph 2(a).
This is subparagraph 2(b).
This is the third paragraph.
--------------------------------------
Expected output:
--------------------------------------
1. This is the first paragraph
2. This is the second paragraph
a. This is subparagraph 2(a).
b. This is subparagraph 2(b).
3. This is the third paragraph.
(In a perfect world it would also include the tabs, but that's a different
issue.)
> Auto-Paragraph numbers not extracted from Word Document
> --------------------------------------------------------
>
> Key: TIKA-1440
> URL: https://issues.apache.org/jira/browse/TIKA-1440
> Project: Tika
> Issue Type: Bug
> Components: parser
> Environment: Windows 7, Windows Server 2008, Tomcat
> Reporter: Steve Gullion
> Priority: Minor
> Labels: numbering, paragraph, word
>
> When the text is extracted from a Microsoft Word document that uses automatic
> numbering, the text of the automatic numbers is not extracted. As the numbers
> can be critical to the meaning of the document (as in the case of
> cross-references), they should be calculated and extracted if at all possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)