[ 
https://issues.apache.org/jira/browse/TIKA-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944977#comment-15944977
 ] 

Steven Hall edited comment on TIKA-2313 at 3/28/17 11:25 AM:
-------------------------------------------------------------

Attached breaking file.

Unfortunately I don't have any light to shed on detecting junk in an 
intelligent way. Right now I'm removing all non-printable characters from the 
output and then verifying the percentage of 1- or 2-letter words is not the 
majority of the document. Seems to work fine but it's definitely not perfect -- 
I also have the benefit of having all documents in a Latin language so removing 
non-printables gives me a lot of benefit. I'll take part in the linked issue if 
I find anything useful.

As for detecting language, I had originally thought that, but then I thought 
that having Chinese characters is completely random - rather it's mapping some 
(broken) encoding from the old documents into a modern encoding, and it just so 
happens that the mapping points to Chinese characters.


was (Author: emmerich):
Attached breaking file.

> Old Word document (Word 6.0, 1997) has a badly encoded(?) output.
> -----------------------------------------------------------------
>
>                 Key: TIKA-2313
>                 URL: https://issues.apache.org/jira/browse/TIKA-2313
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Steven Hall
>            Priority: Minor
>         Attachments: old.DOC
>
>
> I've a really old Word document (last date of modification is December 1997) 
> which was written with Microsoft Word 6.0.
> When I attempt to use Tika to extract the contents of the document, I receive 
> an incorrect output. The output seems to be in Chinese, but I actually 
> believe that the encoding of the document is not correctly mapped with the 
> output encoding which causes characters to be thrown off. I'm a complete 
> beginner in document encodings so could be wrong here!
> I did see TIKA-721 and TIKA-2038, but neither seem to be related to older 
> documents. I've also read that Tika should support Word 6.0 so not sure.
> My guess for the moment is that the encoding within the document has 
> incorrect character mappings. It's possible using an incompatible mapping 
> that, when Tika converts into its UTF-16 output, maps to Chinese characters 
> instead of the correct ones.
> What's interesting is that Tika correctly extracts all the metadata, 
> including the document title, which is presumably in the same encoding as the 
> document body.
> I have 2 questions:
> 1. Is there something I can pass to Tika to help out in detecting the 
> encoding?
> 2. Is there a way of detecting this kind of bad output? In my application the 
> number of documents like this is very small, but I don't have a very reliable 
> way of detecting that the output is garbage.
> Like I said, quite a beginner with Tika so if there's any further commands 
> you would like me to run please say.
> Here is the output of:
> {noformat}java -jar tika-app-1.14.jar old.DOC{noformat}
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="cp:revision" content="3"/>
> <meta name="date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:word-count" content="38"/>
> <meta name="dc:creator" content="Preferred Customer"/>
> <meta name="meta:print-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Word-Count" content="38"/>
> <meta name="dcterms:created" content="1997-12-12T11:31:00Z"/>
> <meta name="dcterms:modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Save-Date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:character-count" content="227"/>
> <meta name="Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="meta:save-date" content="1997-12-12T12:57:00Z"/>
> <meta name="dc:title" content="KATALYSE"/>
> <meta name="Application-Name" content="Microsoft Word 6.0"/>
> <meta name="modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Edit-Time" content="8400000000"/>
> <meta name="Content-Length" content="20480"/>
> <meta name="Content-Type" content="application/msword"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-Parsed-By" 
> content="org.apache.tika.parser.microsoft.OfficeParser"/>
> <meta name="creator" content="Preferred Customer"/>
> <meta name="meta:author" content="Preferred Customer"/>
> <meta name="extended-properties:Application" content="Microsoft Word 6.0"/>
> <meta name="meta:creation-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Last-Printed" content="1997-12-12T11:31:00Z"/>
> <meta name="meta:last-author" content="Preferred Customer"/>
> <meta name="Creation-Date" content="1997-12-12T11:31:00Z"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="resourceName" content="old.DOC"/>
> <meta name="Last-Author" content="Preferred Customer"/>
> <meta name="Character Count" content="227"/>
> <meta name="Page-Count" content="1"/>
> <meta name="Revision-Number" content="3"/>
> <meta name="extended-properties:Template" 
> content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="Author" content="Preferred Customer"/>
> <meta name="meta:page-count" content="1"/>
> <title>KATALYSE</title>
> </head>
> <body><p>䅋䅔奌䕓഍഍䅄䕔㨠䐍瑡ݥ䐓呁⁅䁜樠⽪䵍愯ᑡ㈱ㄯ⼲㜹ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍⁳牯</p>
> <p>㈱㈱</p>
> <p>䴯</p>
> <p>㈱㜹</p>
> <p>愯</p>
> <p>㜹ᨍ</p>
> <p>ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍⁳牯䴠ݲ⹍䈠剅䡔䑏܍܇܇剏䅇䥎䵓⁅ഺ潃灭湡ݹ潓楣瓩⃩偁䱐䍉䱏剏܍܇܇끎䐠⁅䕔䕌佃䥐啅⁒ഺ慆⁸渠ް㐰㜠‹㌸㈠‰㔴܍܇܇䅐䕇ⱓ夠䌠䵏剐卉䌠䱅䕌䌭⁉ഺ慐敧ⱳ椠据畬楤杮琠楨⁳湯ݥܱ܇܇䕄䰠⁁䅐呒䐠⁅㨠䘍潲ݭ畇⁹䕌佃䕌܇഍഍഍潍獮敩牵ബ഍畓瑩⁥⃠潮牴⁥散瑮挠湯慴瑣琠泩烩潨楮畱ⱥ樠愧⁩敬瀠慬獩物搠⁥潣普物敭⁲潮牴⁥敲摮穥瘭畯⁳畤ㄠ‹散扭敲瀠潲档楡⃠㔱と‰慤獮瘠獯氠捯畡⁸敤嘠杯慬獮മ䨍⁥潶獵瀠敳瑮牥楡氠牯⁳敤挠瑥整爠痩楮湯氠獥挠湯汣獵潩獮搠⁵牰ⷩ楤条潮瑳捩猠牴瑡柩煩敵മ഍慄獮挠瑥整愠瑴湥整‬敪瘠畯⁳牰敩搠⁥牣楯敲‬潍獮敩牵‬⃠❬獡畳慲据⁥敤洠獥猠湥楴敭瑮⁳敬⁳敭汩敬牵⹳഍഍഍഍䜉奕䰠䍅䱏൅഍ㄱ‬畲⁥畇汩潬摵ⴠ㘠〹㌰䰠余⁎‭⹬㨠〠⸴㈷㘮⸸㠰〮‸‭慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮‮畡挠灡瑩污搠⁥㘵‶〰‰剆⹓删䌮匮‮慐楲⁳䈠㌠㤷㔠㘶㜠ㄷ഍഍�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁㼁Ǡ☿ἁ쀁Aǰ⤏ā܇Sೀ˿āȀ̃萀㼡⇰'༁︃ἢ䏸àćĀⅿ܀老ἡǸćↀ缥ǰ䔃️℀쀃㼁༂�㼡⏰à�༡˼Ӽā⇀️ﰃā쀋̡෿ǰȃćↀEǀԟāǀȏā⇠︇�쀁̅︁Ēč︉�Ⅻ́耄༤߼⇀️㼂︣Ā⌏þ༁FǠ━ﰟ㼂耄ἡ˸˼П⇀︇܅̅老܁考ć老Ą㼊︦Ā܏ʀӸăƀЇƀ⤃ĉAᎀ⾁︃܃︃�ɽǀȁğ␀༂耄ﰂ༥ϼ⇼ﰏﰃ쀆ἂ�ἤϸϸↀﰏ܂&̀�̉�</p>
> <p>ᨍ</p>
> <p>഍</p>
> <p>ㄍ</p>
> <p>ㄱ‬畲⁥畇汩潬摵ⴠ㘠〹㌰䰠余⁎‭⹬㨠〠⸴㈷㘮⸸㠰〮‸‭慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮‮畡挠灡瑩污搠⁥㘵‶〰‰剆⹓删䌮匮‮慐楲⁳䈠㌠㤷㔠㘶㜠ㄷ഍഍�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁</p>
> <p>഍</p>
> <p>഍</p>
> <p>഍</p>
> <p>ᨍ</p>
> </body></html>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to