[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153920#comment-16153920 ] Tim Allison commented on TIKA-1194: --- This _may_ be related to: https://bz.apache.org/bugzilla/show_bug.cgi?id=61490 > Missing text from MS Word (DOC) file > > > Key: TIKA-1194 > URL: https://issues.apache.org/jira/browse/TIKA-1194 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Tomas Safarik >Priority: Critical > Attachments: apache-tika-1.5.patch, OP-06-015.doc > > > Hello, > we noticed that filtered text from some MS Word DOC files is missing one line > (in table cell) in the original document. > - If you add or remove one character anywhere before the problematic > line/cell then the filtered text is correct. If you get the text back to > original the filtering problem is back. > - If the file is resaved as DOCX filtering works fine. > I will provide sample document. And please let me know if more information is > needed. > Regards, > Tomas -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828279#comment-15828279 ] Tim Allison commented on TIKA-1194: --- [~tssk]...With the attached .doc file, the attached patch won't help, I don't think. The triggering file is handled by the regular HWPFDocument, not the HWPFOldDocument. The problem seems to be in the calculation of the number of cells in that particular row in the table. I'm able to see the text if I iterate through all paragraphs (and ignore table info) or if I call {{.text()}} on the table. > Missing text from MS Word (DOC) file > > > Key: TIKA-1194 > URL: https://issues.apache.org/jira/browse/TIKA-1194 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Tomas Safarik >Priority: Critical > Attachments: apache-tika-1.5.patch, OP-06-015.doc > > > Hello, > we noticed that filtered text from some MS Word DOC files is missing one line > (in table cell) in the original document. > - If you add or remove one character anywhere before the problematic > line/cell then the filtered text is correct. If you get the text back to > original the filtering problem is back. > - If the file is resaved as DOCX filtering works fine. > I will provide sample document. And please let me know if more information is > needed. > Regards, > Tomas -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493888#comment-15493888 ] Tim Allison commented on TIKA-1194: --- Y, this is still failing. > Missing text from MS Word (DOC) file > > > Key: TIKA-1194 > URL: https://issues.apache.org/jira/browse/TIKA-1194 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Tomas Safarik >Priority: Critical > Attachments: OP-06-015.doc, apache-tika-1.5.patch > > > Hello, > we noticed that filtered text from some MS Word DOC files is missing one line > (in table cell) in the original document. > - If you add or remove one character anywhere before the problematic > line/cell then the filtered text is correct. If you get the text back to > original the filtering problem is back. > - If the file is resaved as DOCX filtering works fine. > I will provide sample document. And please let me know if more information is > needed. > Regards, > Tomas -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371960#comment-14371960 ] Tomas Safarik commented on TIKA-1194: - Sorry but no. 1) I don't have the source code. I just got this from our developer when asking what he changed that it works now. 2) I think this is just a workaround and not a proper solution. I think the bug should be created in Apache POI. Tomas Missing text from MS Word (DOC) file Key: TIKA-1194 URL: https://issues.apache.org/jira/browse/TIKA-1194 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Tomas Safarik Priority: Critical Attachments: OP-06-015.doc, apache-tika-1.5.patch Hello, we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document. - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back. - If the file is resaved as DOCX filtering works fine. I will provide sample document. And please let me know if more information is needed. Regards, Tomas -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371939#comment-14371939 ] Tyler Palsulich commented on TIKA-1194: --- Thank you, [~tssk]! Is there any way you can create a patch from {{svn diff}}, instead of (I think) just regular {{diff}}? Then, we can hopefully integrate this into trunk. :) Missing text from MS Word (DOC) file Key: TIKA-1194 URL: https://issues.apache.org/jira/browse/TIKA-1194 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Tomas Safarik Priority: Critical Attachments: OP-06-015.doc, apache-tika-1.5.patch Hello, we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document. - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back. - If the file is resaved as DOCX filtering works fine. I will provide sample document. And please let me know if more information is needed. Regards, Tomas -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330021#comment-14330021 ] Tyler Palsulich commented on TIKA-1194: --- [~tssk], were you ever able to create a safe version of the file? /Do you still have it? It's been a while since this issue was opened. Missing text from MS Word (DOC) file Key: TIKA-1194 URL: https://issues.apache.org/jira/browse/TIKA-1194 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Tomas Safarik Priority: Critical Hello, we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document. - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back. - If the file is resaved as DOCX filtering works fine. I will provide sample document. And please let me know if more information is needed. Regards, Tomas -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820013#comment-13820013 ] Nick Burch commented on TIKA-1194: -- I've had a quick look, and WordExtractor from Apache POI skips the text too My first hunch would be that it's something to do with text fields Any chance you could step through the parser in a debugger, checking the text of the ranges around the point of the missing text, and see if there's anything odd going on? Missing text from MS Word (DOC) file Key: TIKA-1194 URL: https://issues.apache.org/jira/browse/TIKA-1194 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Tomas Safarik Priority: Critical Attachments: OP-06-015.doc Hello, we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document. - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back. - If the file is resaved as DOCX filtering works fine. I will provide sample document. And please let me know if more information is needed. Regards, Tomas -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820080#comment-13820080 ] Tomas Safarik commented on TIKA-1194: - Sorry I needed to remove the document because it contained some personal information. I will upload some safe one as soon as possible. Missing text from MS Word (DOC) file Key: TIKA-1194 URL: https://issues.apache.org/jira/browse/TIKA-1194 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Tomas Safarik Priority: Critical Hello, we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document. - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back. - If the file is resaved as DOCX filtering works fine. I will provide sample document. And please let me know if more information is needed. Regards, Tomas -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820139#comment-13820139 ] Tomas Safarik commented on TIKA-1194: - I can see the text missing in Apache POI WordToTextConverter output. But I can see it ok in one variable in debugger. Should I move with this to Apache POI bug tracker? Missing text from MS Word (DOC) file Key: TIKA-1194 URL: https://issues.apache.org/jira/browse/TIKA-1194 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Tomas Safarik Priority: Critical Hello, we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document. - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back. - If the file is resaved as DOCX filtering works fine. I will provide sample document. And please let me know if more information is needed. Regards, Tomas -- This message was sent by Atlassian JIRA (v6.1#6144)