[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

ASF GitHub Bot (JIRA) Fri, 03 Aug 2018 07:48:31 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568292#comment-16568292
 ]


ASF GitHub Bot commented on TIKA-2701:
--------------------------------------

tballison closed pull request #245: fix for TIKA-2701 contributed by grigoriy
URL: https://github.com/apache/tika/pull/245
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
index e0a2507da..806c7d9ba 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
@@ -62,10 +62,10 @@ public void parse(InputStream stream, ContentHandler 
handler, Metadata metadata,
         xhtml.startDocument();
         try {
             HwmfPicture picture = new HwmfPicture(stream);
+            Charset charset = LocaleUtil.CHARSET_1252;
             //TODO: make x/y info public in POI so that we can use it here
             //to determine when to keep two text parts on the same line
             for (HwmfRecord record : picture.getRecords()) {
-                Charset charset = LocaleUtil.CHARSET_1252;
                 //this is pure hackery for specifying the font
                 //TODO: do what Graphics does by maintaining the stack, etc.!
                 //This fix should be done within POI
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
index 42fb22098..977f74f4e 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
@@ -29,14 +29,20 @@
 
     @Test
     public void testTextExtractionWindows() throws Exception {
-        List<Metadata> metadataList = 
getRecursiveMetadata("testXLSX_Thumbnail.xlsx");
-        Metadata wmfMetadata = metadataList.get(1);
-        assertEquals("image/wmf", wmfMetadata.get(Metadata.CONTENT_TYPE));
-        assertContains("This file contains an embedded thumbnail",
-                wmfMetadata.get(RecursiveParserWrapper.TIKA_CONTENT));
+        testTextExtraction("testXLSX_Thumbnail.xlsx", 1, "This file contains 
an embedded thumbnail");
+    }
+
+    @Test
+    public void testTextExtractionShiftJISencoding() throws Exception {
+        testTextExtraction("thumbnail_1.wmf", 0, "普林斯");
     }
 
-    //TODO fix wmf text extraction in "testRTFEmbeddedFiles.rtf"
-    //Chinese is garbled.
+    private void testTextExtraction(String fileName, int metaDataItemIndex, 
String expectedText) throws Exception {
+        List<Metadata> metadataList = getRecursiveMetadata(fileName);
+        Metadata wmfMetadata = metadataList.get(metaDataItemIndex);
+
+        assertEquals("image/wmf", wmfMetadata.get(Metadata.CONTENT_TYPE));
+        assertContains(expectedText, 
wmfMetadata.get(RecursiveParserWrapper.TIKA_CONTENT));
+    }
 }
 
diff --git a/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf 
b/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf
new file mode 100644
index 000000000..b860d183d
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf differ


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Text is not extracted properly from WMF files
> ---------------------------------------------
>
>                 Key: TIKA-2701
>                 URL: https://issues.apache.org/jira/browse/TIKA-2701
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Grigoriy Alekseev
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

Reply via email to