[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

2018-08-03 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568566#comment-16568566
 ] 

Hudson commented on TIKA-2701:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #67 (See 
[https://builds.apache.org/job/tika-branch-1x/67/])
TIKA-2701 -- via Grigoriy Alekseev (tallison: 
[https://github.com/apache/tika/commit/d66dcbbbcf59f5b4034a47fed3346ad513b1fcc9])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
* (add) tika-parsers/src/test/resources/test-documents/testWMF_charset.wmf


> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

2018-08-03 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568550#comment-16568550
 ] 

Hudson commented on TIKA-2701:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1536 (See 
[https://builds.apache.org/job/Tika-trunk/1536/])
fix for TIKA-2701 contributed by grigoriy (grigoriyalexeev: 
[https://github.com/apache/tika/commit/ba436227b987199717b5b94689d10c7b0d1deb14])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
* (add) tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
TIKA-2701 change test file name (tallison: 
[https://github.com/apache/tika/commit/7e477e3ae6eaf819f42be2d7eb43aaef89002e4c])
* (add) tika-parsers/src/test/resources/test-documents/testWMF_charset.wmf
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
* (delete) tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf


> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

2018-08-03 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568374#comment-16568374
 ] 

Hudson commented on TIKA-2701:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #291 (See 
[https://builds.apache.org/job/tika-2.x-windows/291/])
fix for TIKA-2701 contributed by grigoriy (grigoriyalexeev: rev 
ba436227b987199717b5b94689d10c7b0d1deb14)
* (add) tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
TIKA-2701 change test file name (tallison: rev 
7e477e3ae6eaf819f42be2d7eb43aaef89002e4c)
* (add) tika-parsers/src/test/resources/test-documents/testWMF_charset.wmf
* (delete) tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java


> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

2018-08-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568292#comment-16568292
 ] 

ASF GitHub Bot commented on TIKA-2701:
--

tballison closed pull request #245: fix for TIKA-2701 contributed by grigoriy
URL: https://github.com/apache/tika/pull/245
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
index e0a2507da..806c7d9ba 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
@@ -62,10 +62,10 @@ public void parse(InputStream stream, ContentHandler 
handler, Metadata metadata,
 xhtml.startDocument();
 try {
 HwmfPicture picture = new HwmfPicture(stream);
+Charset charset = LocaleUtil.CHARSET_1252;
 //TODO: make x/y info public in POI so that we can use it here
 //to determine when to keep two text parts on the same line
 for (HwmfRecord record : picture.getRecords()) {
-Charset charset = LocaleUtil.CHARSET_1252;
 //this is pure hackery for specifying the font
 //TODO: do what Graphics does by maintaining the stack, etc.!
 //This fix should be done within POI
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
index 42fb22098..977f74f4e 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WMFParserTest.java
@@ -29,14 +29,20 @@
 
 @Test
 public void testTextExtractionWindows() throws Exception {
-List metadataList = 
getRecursiveMetadata("testXLSX_Thumbnail.xlsx");
-Metadata wmfMetadata = metadataList.get(1);
-assertEquals("image/wmf", wmfMetadata.get(Metadata.CONTENT_TYPE));
-assertContains("This file contains an embedded thumbnail",
-wmfMetadata.get(RecursiveParserWrapper.TIKA_CONTENT));
+testTextExtraction("testXLSX_Thumbnail.xlsx", 1, "This file contains 
an embedded thumbnail");
+}
+
+@Test
+public void testTextExtractionShiftJISencoding() throws Exception {
+testTextExtraction("thumbnail_1.wmf", 0, "普林斯");
 }
 
-//TODO fix wmf text extraction in "testRTFEmbeddedFiles.rtf"
-//Chinese is garbled.
+private void testTextExtraction(String fileName, int metaDataItemIndex, 
String expectedText) throws Exception {
+List metadataList = getRecursiveMetadata(fileName);
+Metadata wmfMetadata = metadataList.get(metaDataItemIndex);
+
+assertEquals("image/wmf", wmfMetadata.get(Metadata.CONTENT_TYPE));
+assertContains(expectedText, 
wmfMetadata.get(RecursiveParserWrapper.TIKA_CONTENT));
+}
 }
 
diff --git a/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf 
b/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf
new file mode 100644
index 0..b860d183d
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/thumbnail_1.wmf differ


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

2018-08-01 Thread Grigoriy Alekseev (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566370#comment-16566370
 ] 

Grigoriy Alekseev commented on TIKA-2701:
-

[~talli...@apache.org], my pleasure :)

> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

2018-08-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566365#comment-16566365
 ] 

ASF GitHub Bot commented on TIKA-2701:
--

grigoriy opened a new pull request #245: fix for TIKA-2701 contributed by 
grigoriy
URL: https://github.com/apache/tika/pull/245
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

2018-08-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565701#comment-16565701
 ] 

Tim Allison commented on TIKA-2701:
---

+1 cannot describe the joy this brings me that someone cares about a) WMF b) 
encodings. :D  Thank you!

> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2701) Text is not extracted properly from WMF files

2018-08-01 Thread Grigoriy Alekseev (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565069#comment-16565069
 ] 

Grigoriy Alekseev commented on TIKA-2701:
-

Will create a pull request.

> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)