[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

2015-03-25 Thread Steve Gullion (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380453#comment-14380453
 ] 

Steve Gullion commented on TIKA-1440:
-

This was cut and pasted from Word (spaces added for indentation):
--
1.  This is the first paragraph
2.  This is the second paragraph
a.  This is subparagraph 2(a).
b.  This is subparagraph 2(b).
3.  This is the third paragraph.
---

This is the Tika output:
---
This is the first paragraph
This is the second paragraph
This is subparagraph 2(a).
This is subparagraph 2(b).
This is the third paragraph.
--

Expected output:
--
1.  This is the first paragraph
2.  This is the second paragraph
a.  This is subparagraph 2(a).
b.  This is subparagraph 2(b).
3.  This is the third paragraph.

(In a perfect world it would also include the tabs, but that's a different 
issue.)

 Auto-Paragraph numbers not extracted from Word Document 
 

 Key: TIKA-1440
 URL: https://issues.apache.org/jira/browse/TIKA-1440
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: Windows 7, Windows Server 2008, Tomcat
Reporter: Steve Gullion
Priority: Minor
  Labels: numbering, paragraph, word

 When the text is extracted from a Microsoft Word document that uses automatic 
 numbering, the text of the automatic numbers is not extracted. As the numbers 
 can be critical to the meaning of the document (as in the case of 
 cross-references), they should be calculated and extracted if at all possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

2015-03-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380393#comment-14380393
 ] 

Tim Allison commented on TIKA-1440:
---

Able to post a mock-up document and expected output?  Can't tell if we'll be 
able to do this at the Tika level or if we'll need mods to POI.

 Auto-Paragraph numbers not extracted from Word Document 
 

 Key: TIKA-1440
 URL: https://issues.apache.org/jira/browse/TIKA-1440
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: Windows 7, Windows Server 2008, Tomcat
Reporter: Steve Gullion
Priority: Minor
  Labels: numbering, paragraph, word

 When the text is extracted from a Microsoft Word document that uses automatic 
 numbering, the text of the automatic numbers is not extracted. As the numbers 
 can be critical to the meaning of the document (as in the case of 
 cross-references), they should be calculated and extracted if at all possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

2015-03-25 Thread Steve Gullion (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380466#comment-14380466
 ] 

Steve Gullion commented on TIKA-1440:
-

Also, I just confirmed that this applies to both .doc and .docx.

 Auto-Paragraph numbers not extracted from Word Document 
 

 Key: TIKA-1440
 URL: https://issues.apache.org/jira/browse/TIKA-1440
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: Windows 7, Windows Server 2008, Tomcat
Reporter: Steve Gullion
Priority: Minor
  Labels: numbering, paragraph, word

 When the text is extracted from a Microsoft Word document that uses automatic 
 numbering, the text of the automatic numbers is not extracted. As the numbers 
 can be critical to the meaning of the document (as in the case of 
 cross-references), they should be calculated and extracted if at all possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

2015-03-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380496#comment-14380496
 ] 

Tim Allison commented on TIKA-1440:
---

Apologies if this message crosses your attachments in the ether, but would you 
also be able to attach both .doc and .docx examples?

 Auto-Paragraph numbers not extracted from Word Document 
 

 Key: TIKA-1440
 URL: https://issues.apache.org/jira/browse/TIKA-1440
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: Windows 7, Windows Server 2008, Tomcat
Reporter: Steve Gullion
Priority: Minor
  Labels: numbering, paragraph, word
 Attachments: Tika Test.docx, Tika test 2003.doc


 When the text is extracted from a Microsoft Word document that uses automatic 
 numbering, the text of the automatic numbers is not extracted. As the numbers 
 can be critical to the meaning of the document (as in the case of 
 cross-references), they should be calculated and extracted if at all possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1440) Auto-Paragraph numbers not extracted from Word Document

2015-03-25 Thread Steve Gullion (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380500#comment-14380500
 ] 

Steve Gullion commented on TIKA-1440:
-

Never mind, duh, I found the attach files menu item.



 Auto-Paragraph numbers not extracted from Word Document 
 

 Key: TIKA-1440
 URL: https://issues.apache.org/jira/browse/TIKA-1440
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: Windows 7, Windows Server 2008, Tomcat
Reporter: Steve Gullion
Priority: Minor
  Labels: numbering, paragraph, word
 Attachments: Tika Test.docx, Tika test 2003.doc


 When the text is extracted from a Microsoft Word document that uses automatic 
 numbering, the text of the automatic numbers is not extracted. As the numbers 
 can be critical to the meaning of the document (as in the case of 
 cross-references), they should be calculated and extracted if at all possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)