[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568566#comment-16568566 ] Hudson commented on TIKA-2701: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #67 (See [https://builds.apache.org/job/tika-branch-1x/67/]) TIKA-2701 -- via Grigoriy Alekseev (tallison: [https://github.com/apache/tika/commit/d66dcbbbcf59f5b4034a47fed3346ad513b1fcc9]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java * (add) tika-parsers/src/test/resources/test-documents/testWMF_charset.wmf > Text is not extracted properly from WMF files > - > > Key: TIKA-2701 > URL: https://issues.apache.org/jira/browse/TIKA-2701 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Grigoriy Alekseev >Priority: Major > Fix For: 2.0.0 > > Attachments: thumbnail_1.wmf > > > Text is always extracted assuming it is in cp-1252 encoding. The attached > thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be > 普林斯. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568550#comment-16568550 ] Hudson commented on TIKA-2701: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1536 (See [https://builds.apache.org/job/Tika-trunk/1536/]) fix for TIKA-2701 contributed by grigoriy (grigoriyalexeev: [https://github.com/apache/tika/commit/ba436227b987199717b5b94689d10c7b0d1deb14]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java * (add) tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java TIKA-2701 change test file name (tallison: [https://github.com/apache/tika/commit/7e477e3ae6eaf819f42be2d7eb43aaef89002e4c]) * (add) tika-parsers/src/test/resources/test-documents/testWMF_charset.wmf * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java * (delete) tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf > Text is not extracted properly from WMF files > - > > Key: TIKA-2701 > URL: https://issues.apache.org/jira/browse/TIKA-2701 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Grigoriy Alekseev >Priority: Major > Fix For: 2.0.0 > > Attachments: thumbnail_1.wmf > > > Text is always extracted assuming it is in cp-1252 encoding. The attached > thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be > 普林斯. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568374#comment-16568374 ] Hudson commented on TIKA-2701: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #291 (See [https://builds.apache.org/job/tika-2.x-windows/291/]) fix for TIKA-2701 contributed by grigoriy (grigoriyalexeev: rev ba436227b987199717b5b94689d10c7b0d1deb14) * (add) tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java TIKA-2701 change test file name (tallison: rev 7e477e3ae6eaf819f42be2d7eb43aaef89002e4c) * (add) tika-parsers/src/test/resources/test-documents/testWMF_charset.wmf * (delete) tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java > Text is not extracted properly from WMF files > - > > Key: TIKA-2701 > URL: https://issues.apache.org/jira/browse/TIKA-2701 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Grigoriy Alekseev >Priority: Major > Fix For: 2.0.0 > > Attachments: thumbnail_1.wmf > > > Text is always extracted assuming it is in cp-1252 encoding. The attached > thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be > 普林斯. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568292#comment-16568292 ] ASF GitHub Bot commented on TIKA-2701: -- tballison closed pull request #245: fix for TIKA-2701 contributed by grigoriy URL: https://github.com/apache/tika/pull/245 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java index e0a2507da..806c7d9ba 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java @@ -62,10 +62,10 @@ public void parse(InputStream stream, ContentHandler handler, Metadata metadata, xhtml.startDocument(); try { HwmfPicture picture = new HwmfPicture(stream); +Charset charset = LocaleUtil.CHARSET_1252; //TODO: make x/y info public in POI so that we can use it here //to determine when to keep two text parts on the same line for (HwmfRecord record : picture.getRecords()) { -Charset charset = LocaleUtil.CHARSET_1252; //this is pure hackery for specifying the font //TODO: do what Graphics does by maintaining the stack, etc.! //This fix should be done within POI diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java index 42fb22098..977f74f4e 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java @@ -29,14 +29,20 @@ @Test public void testTextExtractionWindows() throws Exception { -List metadataList = getRecursiveMetadata("testXLSX_Thumbnail.xlsx"); -Metadata wmfMetadata = metadataList.get(1); -assertEquals("image/wmf", wmfMetadata.get(Metadata.CONTENT_TYPE)); -assertContains("This file contains an embedded thumbnail", -wmfMetadata.get(RecursiveParserWrapper.TIKA_CONTENT)); +testTextExtraction("testXLSX_Thumbnail.xlsx", 1, "This file contains an embedded thumbnail"); +} + +@Test +public void testTextExtractionShiftJISencoding() throws Exception { +testTextExtraction("thumbnail_1.wmf", 0, "普林斯"); } -//TODO fix wmf text extraction in "testRTFEmbeddedFiles.rtf" -//Chinese is garbled. +private void testTextExtraction(String fileName, int metaDataItemIndex, String expectedText) throws Exception { +List metadataList = getRecursiveMetadata(fileName); +Metadata wmfMetadata = metadataList.get(metaDataItemIndex); + +assertEquals("image/wmf", wmfMetadata.get(Metadata.CONTENT_TYPE)); +assertContains(expectedText, wmfMetadata.get(RecursiveParserWrapper.TIKA_CONTENT)); +} } diff --git a/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf b/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf new file mode 100644 index 0..b860d183d Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf differ This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Text is not extracted properly from WMF files > - > > Key: TIKA-2701 > URL: https://issues.apache.org/jira/browse/TIKA-2701 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Grigoriy Alekseev >Priority: Major > Fix For: 2.0.0 > > Attachments: thumbnail_1.wmf > > > Text is always extracted assuming it is in cp-1252 encoding. The attached > thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be > 普林斯. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566370#comment-16566370 ] Grigoriy Alekseev commented on TIKA-2701: - [~talli...@apache.org], my pleasure :) > Text is not extracted properly from WMF files > - > > Key: TIKA-2701 > URL: https://issues.apache.org/jira/browse/TIKA-2701 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Grigoriy Alekseev >Priority: Major > Fix For: 2.0.0 > > Attachments: thumbnail_1.wmf > > > Text is always extracted assuming it is in cp-1252 encoding. The attached > thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be > 普林斯. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566365#comment-16566365 ] ASF GitHub Bot commented on TIKA-2701: -- grigoriy opened a new pull request #245: fix for TIKA-2701 contributed by grigoriy URL: https://github.com/apache/tika/pull/245 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Text is not extracted properly from WMF files > - > > Key: TIKA-2701 > URL: https://issues.apache.org/jira/browse/TIKA-2701 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Grigoriy Alekseev >Priority: Major > Fix For: 2.0.0 > > Attachments: thumbnail_1.wmf > > > Text is always extracted assuming it is in cp-1252 encoding. The attached > thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be > 普林斯. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565701#comment-16565701 ] Tim Allison commented on TIKA-2701: --- +1 cannot describe the joy this brings me that someone cares about a) WMF b) encodings. :D Thank you! > Text is not extracted properly from WMF files > - > > Key: TIKA-2701 > URL: https://issues.apache.org/jira/browse/TIKA-2701 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Grigoriy Alekseev >Priority: Major > Fix For: 2.0.0 > > Attachments: thumbnail_1.wmf > > > Text is always extracted assuming it is in cp-1252 encoding. The attached > thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be > 普林斯. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565069#comment-16565069 ] Grigoriy Alekseev commented on TIKA-2701: - Will create a pull request. > Text is not extracted properly from WMF files > - > > Key: TIKA-2701 > URL: https://issues.apache.org/jira/browse/TIKA-2701 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Grigoriy Alekseev >Priority: Major > Fix For: 2.0.0 > > Attachments: thumbnail_1.wmf > > > Text is always extracted assuming it is in cp-1252 encoding. The attached > thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be > 普林斯. -- This message was sent by Atlassian JIRA (v7.6.3#76005)