[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document
[ https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380453#comment-14380453 ] Steve Gullion commented on TIKA-1440: - This was cut and pasted from Word (spaces added for indentation): -- 1. This is the first paragraph 2. This is the second paragraph a. This is subparagraph 2(a). b. This is subparagraph 2(b). 3. This is the third paragraph. --- This is the Tika output: --- This is the first paragraph This is the second paragraph This is subparagraph 2(a). This is subparagraph 2(b). This is the third paragraph. -- Expected output: -- 1. This is the first paragraph 2. This is the second paragraph a. This is subparagraph 2(a). b. This is subparagraph 2(b). 3. This is the third paragraph. (In a perfect world it would also include the tabs, but that's a different issue.) Auto-Paragraph numbers not extracted from Word Document Key: TIKA-1440 URL: https://issues.apache.org/jira/browse/TIKA-1440 Project: Tika Issue Type: Bug Components: parser Environment: Windows 7, Windows Server 2008, Tomcat Reporter: Steve Gullion Priority: Minor Labels: numbering, paragraph, word When the text is extracted from a Microsoft Word document that uses automatic numbering, the text of the automatic numbers is not extracted. As the numbers can be critical to the meaning of the document (as in the case of cross-references), they should be calculated and extracted if at all possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document
[ https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380393#comment-14380393 ] Tim Allison commented on TIKA-1440: --- Able to post a mock-up document and expected output? Can't tell if we'll be able to do this at the Tika level or if we'll need mods to POI. Auto-Paragraph numbers not extracted from Word Document Key: TIKA-1440 URL: https://issues.apache.org/jira/browse/TIKA-1440 Project: Tika Issue Type: Bug Components: parser Environment: Windows 7, Windows Server 2008, Tomcat Reporter: Steve Gullion Priority: Minor Labels: numbering, paragraph, word When the text is extracted from a Microsoft Word document that uses automatic numbering, the text of the automatic numbers is not extracted. As the numbers can be critical to the meaning of the document (as in the case of cross-references), they should be calculated and extracted if at all possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document
[ https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380466#comment-14380466 ] Steve Gullion commented on TIKA-1440: - Also, I just confirmed that this applies to both .doc and .docx. Auto-Paragraph numbers not extracted from Word Document Key: TIKA-1440 URL: https://issues.apache.org/jira/browse/TIKA-1440 Project: Tika Issue Type: Bug Components: parser Environment: Windows 7, Windows Server 2008, Tomcat Reporter: Steve Gullion Priority: Minor Labels: numbering, paragraph, word When the text is extracted from a Microsoft Word document that uses automatic numbering, the text of the automatic numbers is not extracted. As the numbers can be critical to the meaning of the document (as in the case of cross-references), they should be calculated and extracted if at all possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document
[ https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380496#comment-14380496 ] Tim Allison commented on TIKA-1440: --- Apologies if this message crosses your attachments in the ether, but would you also be able to attach both .doc and .docx examples? Auto-Paragraph numbers not extracted from Word Document Key: TIKA-1440 URL: https://issues.apache.org/jira/browse/TIKA-1440 Project: Tika Issue Type: Bug Components: parser Environment: Windows 7, Windows Server 2008, Tomcat Reporter: Steve Gullion Priority: Minor Labels: numbering, paragraph, word Attachments: Tika Test.docx, Tika test 2003.doc When the text is extracted from a Microsoft Word document that uses automatic numbering, the text of the automatic numbers is not extracted. As the numbers can be critical to the meaning of the document (as in the case of cross-references), they should be calculated and extracted if at all possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document
[ https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380500#comment-14380500 ] Steve Gullion commented on TIKA-1440: - Never mind, duh, I found the attach files menu item. Auto-Paragraph numbers not extracted from Word Document Key: TIKA-1440 URL: https://issues.apache.org/jira/browse/TIKA-1440 Project: Tika Issue Type: Bug Components: parser Environment: Windows 7, Windows Server 2008, Tomcat Reporter: Steve Gullion Priority: Minor Labels: numbering, paragraph, word Attachments: Tika Test.docx, Tika test 2003.doc When the text is extracted from a Microsoft Word document that uses automatic numbering, the text of the automatic numbers is not extracted. As the numbers can be critical to the meaning of the document (as in the case of cross-references), they should be calculated and extracted if at all possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)