[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2017-09-05 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153920#comment-16153920
 ] 

Tim Allison commented on TIKA-1194:
---

This _may_ be related to: https://bz.apache.org/bugzilla/show_bug.cgi?id=61490

> Missing text from MS Word (DOC) file
> 
>
> Key: TIKA-1194
> URL: https://issues.apache.org/jira/browse/TIKA-1194
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Tomas Safarik
>Priority: Critical
> Attachments: apache-tika-1.5.patch, OP-06-015.doc
>
>
> Hello,
> we noticed that filtered text from some MS Word DOC files is missing one line 
> (in table cell) in the original document.
> - If you add or remove one character anywhere before the problematic 
> line/cell then the filtered text is correct. If you get the text back to 
> original the filtering problem is back.
> - If the file is resaved as DOCX filtering works fine.
> I will provide sample document. And please let me know if more information is 
> needed.
> Regards,
> Tomas



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2017-01-18 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828279#comment-15828279
 ] 

Tim Allison commented on TIKA-1194:
---

[~tssk]...With the attached .doc file, the attached patch won't help, I don't 
think.  The triggering file is handled by the regular HWPFDocument, not the 
HWPFOldDocument.

The problem seems to be in the calculation of the number of cells in that 
particular row in the table.

I'm able to see the text if I iterate through all paragraphs (and ignore table 
info) or if I call {{.text()}} on the table.

> Missing text from MS Word (DOC) file
> 
>
> Key: TIKA-1194
> URL: https://issues.apache.org/jira/browse/TIKA-1194
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Tomas Safarik
>Priority: Critical
> Attachments: apache-tika-1.5.patch, OP-06-015.doc
>
>
> Hello,
> we noticed that filtered text from some MS Word DOC files is missing one line 
> (in table cell) in the original document.
> - If you add or remove one character anywhere before the problematic 
> line/cell then the filtered text is correct. If you get the text back to 
> original the filtering problem is back.
> - If the file is resaved as DOCX filtering works fine.
> I will provide sample document. And please let me know if more information is 
> needed.
> Regards,
> Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2016-09-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493888#comment-15493888
 ] 

Tim Allison commented on TIKA-1194:
---

Y, this is still failing.

> Missing text from MS Word (DOC) file
> 
>
> Key: TIKA-1194
> URL: https://issues.apache.org/jira/browse/TIKA-1194
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Tomas Safarik
>Priority: Critical
> Attachments: OP-06-015.doc, apache-tika-1.5.patch
>
>
> Hello,
> we noticed that filtered text from some MS Word DOC files is missing one line 
> (in table cell) in the original document.
> - If you add or remove one character anywhere before the problematic 
> line/cell then the filtered text is correct. If you get the text back to 
> original the filtering problem is back.
> - If the file is resaved as DOCX filtering works fine.
> I will provide sample document. And please let me know if more information is 
> needed.
> Regards,
> Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2015-03-20 Thread Tomas Safarik (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371960#comment-14371960
 ] 

Tomas Safarik commented on TIKA-1194:
-

Sorry but no.

1) I don't have the source code. I just got this from our developer when asking 
what he changed that it works now.

2) I think this is just a workaround and not a proper solution. I think the bug 
should be created in Apache POI.

Tomas

 Missing text from MS Word (DOC) file
 

 Key: TIKA-1194
 URL: https://issues.apache.org/jira/browse/TIKA-1194
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Tomas Safarik
Priority: Critical
 Attachments: OP-06-015.doc, apache-tika-1.5.patch


 Hello,
 we noticed that filtered text from some MS Word DOC files is missing one line 
 (in table cell) in the original document.
 - If you add or remove one character anywhere before the problematic 
 line/cell then the filtered text is correct. If you get the text back to 
 original the filtering problem is back.
 - If the file is resaved as DOCX filtering works fine.
 I will provide sample document. And please let me know if more information is 
 needed.
 Regards,
 Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371939#comment-14371939
 ] 

Tyler Palsulich commented on TIKA-1194:
---

Thank you, [~tssk]! Is there any way you can create a patch from {{svn diff}}, 
instead of (I think) just regular {{diff}}? Then, we can hopefully integrate 
this into trunk. :)

 Missing text from MS Word (DOC) file
 

 Key: TIKA-1194
 URL: https://issues.apache.org/jira/browse/TIKA-1194
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Tomas Safarik
Priority: Critical
 Attachments: OP-06-015.doc, apache-tika-1.5.patch


 Hello,
 we noticed that filtered text from some MS Word DOC files is missing one line 
 (in table cell) in the original document.
 - If you add or remove one character anywhere before the problematic 
 line/cell then the filtered text is correct. If you get the text back to 
 original the filtering problem is back.
 - If the file is resaved as DOCX filtering works fine.
 I will provide sample document. And please let me know if more information is 
 needed.
 Regards,
 Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2015-02-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330021#comment-14330021
 ] 

Tyler Palsulich commented on TIKA-1194:
---

[~tssk], were you ever able to create a safe version of the file? /Do you 
still have it? It's been a while since this issue was opened.

 Missing text from MS Word (DOC) file
 

 Key: TIKA-1194
 URL: https://issues.apache.org/jira/browse/TIKA-1194
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Tomas Safarik
Priority: Critical

 Hello,
 we noticed that filtered text from some MS Word DOC files is missing one line 
 (in table cell) in the original document.
 - If you add or remove one character anywhere before the problematic 
 line/cell then the filtered text is correct. If you get the text back to 
 original the filtering problem is back.
 - If the file is resaved as DOCX filtering works fine.
 I will provide sample document. And please let me know if more information is 
 needed.
 Regards,
 Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2013-11-12 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820013#comment-13820013
 ] 

Nick Burch commented on TIKA-1194:
--

I've had a quick look, and WordExtractor from Apache POI skips the text too

My first hunch would be that it's something to do with text fields

Any chance you could step through the parser in a debugger, checking the text 
of the ranges around the point of the missing text, and see if there's anything 
odd going on?

 Missing text from MS Word (DOC) file
 

 Key: TIKA-1194
 URL: https://issues.apache.org/jira/browse/TIKA-1194
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Tomas Safarik
Priority: Critical
 Attachments: OP-06-015.doc


 Hello,
 we noticed that filtered text from some MS Word DOC files is missing one line 
 (in table cell) in the original document.
 - If you add or remove one character anywhere before the problematic 
 line/cell then the filtered text is correct. If you get the text back to 
 original the filtering problem is back.
 - If the file is resaved as DOCX filtering works fine.
 I will provide sample document. And please let me know if more information is 
 needed.
 Regards,
 Tomas



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2013-11-12 Thread Tomas Safarik (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820080#comment-13820080
 ] 

Tomas Safarik commented on TIKA-1194:
-

Sorry I needed to remove the document because it contained some personal 
information. I will upload some safe one as soon as possible.

 Missing text from MS Word (DOC) file
 

 Key: TIKA-1194
 URL: https://issues.apache.org/jira/browse/TIKA-1194
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Tomas Safarik
Priority: Critical

 Hello,
 we noticed that filtered text from some MS Word DOC files is missing one line 
 (in table cell) in the original document.
 - If you add or remove one character anywhere before the problematic 
 line/cell then the filtered text is correct. If you get the text back to 
 original the filtering problem is back.
 - If the file is resaved as DOCX filtering works fine.
 I will provide sample document. And please let me know if more information is 
 needed.
 Regards,
 Tomas



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2013-11-12 Thread Tomas Safarik (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820139#comment-13820139
 ] 

Tomas Safarik commented on TIKA-1194:
-

I can see the text missing in Apache POI WordToTextConverter output. But I can 
see it ok in one variable in debugger. Should I move with this to Apache POI 
bug tracker?

 Missing text from MS Word (DOC) file
 

 Key: TIKA-1194
 URL: https://issues.apache.org/jira/browse/TIKA-1194
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Tomas Safarik
Priority: Critical

 Hello,
 we noticed that filtered text from some MS Word DOC files is missing one line 
 (in table cell) in the original document.
 - If you add or remove one character anywhere before the problematic 
 line/cell then the filtered text is correct. If you get the text back to 
 original the filtering problem is back.
 - If the file is resaved as DOCX filtering works fine.
 I will provide sample document. And please let me know if more information is 
 needed.
 Regards,
 Tomas



--
This message was sent by Atlassian JIRA
(v6.1#6144)