[
https://issues.apache.org/jira/browse/TIKA-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944977#comment-15944977
]
Steven Hall edited comment on TIKA-2313 at 3/28/17 11:26 AM:
-------------------------------------------------------------
Attached breaking file.
Unfortunately I don't have any light to shed on detecting junk in an
intelligent way. Right now I'm removing all non-printable characters from the
output and then verifying the percentage of 1- or 2-letter words is not the
majority of the document. Broken tend to have a large number of 1-letter words
once non-printables have been removed.
Seems to work fine but it's definitely not perfect -- I also have the benefit
of having all documents in a Latin language so removing non-printables gives me
a lot of benefit. I'll take part in the linked issue if I find anything useful.
As for detecting language, I had originally thought that, but then I thought
that having Chinese characters is completely random - rather it's mapping some
(broken) encoding from the old documents into a modern encoding, and it just so
happens that the mapping points to Chinese characters.
was (Author: emmerich):
Attached breaking file.
Unfortunately I don't have any light to shed on detecting junk in an
intelligent way. Right now I'm removing all non-printable characters from the
output and then verifying the percentage of 1- or 2-letter words is not the
majority of the document. Seems to work fine but it's definitely not perfect --
I also have the benefit of having all documents in a Latin language so removing
non-printables gives me a lot of benefit. I'll take part in the linked issue if
I find anything useful.
As for detecting language, I had originally thought that, but then I thought
that having Chinese characters is completely random - rather it's mapping some
(broken) encoding from the old documents into a modern encoding, and it just so
happens that the mapping points to Chinese characters.
> Old Word document (Word 6.0, 1997) has a badly encoded(?) output.
> -----------------------------------------------------------------
>
> Key: TIKA-2313
> URL: https://issues.apache.org/jira/browse/TIKA-2313
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.14
> Reporter: Steven Hall
> Priority: Minor
> Attachments: old.DOC
>
>
> I've a really old Word document (last date of modification is December 1997)
> which was written with Microsoft Word 6.0.
> When I attempt to use Tika to extract the contents of the document, I receive
> an incorrect output. The output seems to be in Chinese, but I actually
> believe that the encoding of the document is not correctly mapped with the
> output encoding which causes characters to be thrown off. I'm a complete
> beginner in document encodings so could be wrong here!
> I did see TIKA-721 and TIKA-2038, but neither seem to be related to older
> documents. I've also read that Tika should support Word 6.0 so not sure.
> My guess for the moment is that the encoding within the document has
> incorrect character mappings. It's possible using an incompatible mapping
> that, when Tika converts into its UTF-16 output, maps to Chinese characters
> instead of the correct ones.
> What's interesting is that Tika correctly extracts all the metadata,
> including the document title, which is presumably in the same encoding as the
> document body.
> I have 2 questions:
> 1. Is there something I can pass to Tika to help out in detecting the
> encoding?
> 2. Is there a way of detecting this kind of bad output? In my application the
> number of documents like this is very small, but I don't have a very reliable
> way of detecting that the output is garbage.
> Like I said, quite a beginner with Tika so if there's any further commands
> you would like me to run please say.
> Here is the output of:
> {noformat}java -jar tika-app-1.14.jar old.DOC{noformat}
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html
> xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="cp:revision" content="3"/>
> <meta name="date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:word-count" content="38"/>
> <meta name="dc:creator" content="Preferred Customer"/>
> <meta name="meta:print-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Word-Count" content="38"/>
> <meta name="dcterms:created" content="1997-12-12T11:31:00Z"/>
> <meta name="dcterms:modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Save-Date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:character-count" content="227"/>
> <meta name="Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="meta:save-date" content="1997-12-12T12:57:00Z"/>
> <meta name="dc:title" content="KATALYSE"/>
> <meta name="Application-Name" content="Microsoft Word 6.0"/>
> <meta name="modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Edit-Time" content="8400000000"/>
> <meta name="Content-Length" content="20480"/>
> <meta name="Content-Type" content="application/msword"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-Parsed-By"
> content="org.apache.tika.parser.microsoft.OfficeParser"/>
> <meta name="creator" content="Preferred Customer"/>
> <meta name="meta:author" content="Preferred Customer"/>
> <meta name="extended-properties:Application" content="Microsoft Word 6.0"/>
> <meta name="meta:creation-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Last-Printed" content="1997-12-12T11:31:00Z"/>
> <meta name="meta:last-author" content="Preferred Customer"/>
> <meta name="Creation-Date" content="1997-12-12T11:31:00Z"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="resourceName" content="old.DOC"/>
> <meta name="Last-Author" content="Preferred Customer"/>
> <meta name="Character Count" content="227"/>
> <meta name="Page-Count" content="1"/>
> <meta name="Revision-Number" content="3"/>
> <meta name="extended-properties:Template"
> content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="Author" content="Preferred Customer"/>
> <meta name="meta:page-count" content="1"/>
> <title>KATALYSE</title>
> </head>
> <body><p>䅋䅔奌䕓䅄䕔㨠䐍瑡ݥ䐓呁⁅䁜樠⽪䵍愯ᑡ㈱ㄯ⼲㜹ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍牯</p>
> <p>㈱㈱</p>
> <p>䴯</p>
> <p>㈱㜹</p>
> <p>愯</p>
> <p>㜹ᨍ</p>
> <p>ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍牯䴠ݲ⹍䈠剅䡔䑏܍܇܇剏䅇䥎䵓⁅ഺ潃灭湡ݹ潓楣瓩⃩偁䱐䍉䱏剏܍܇܇끎䐠⁅䕔䕌佃䥐啅⁒ഺ慆⁸渠ް㐰㜠‹㌸㈠‰㔴܍܇܇䅐䕇ⱓ夠䌠䵏剐卉䌠䱅䕌䌭⁉ഺ慐敧ⱳ椠据畬楤杮琠楨湯ݥܱ܇܇䕄䰠⁁䅐呒䐠⁅㨠䘍潲ݭ畇⁹䕌佃䕌܇潍獮敩牵ബ畓瑩⃠潮牴散瑮挠湯慴瑣琠泩烩潨楮畱ⱥ樠愧敬瀠慬獩物搠潣普物敭潮牴敲摮穥瘭畯畤ㄠ‹散扭敲瀠潲档楡⃠㔱と‰慤獮瘠獯氠捯畡⁸敤嘠杯慬獮മ䨍潶獵瀠敳瑮牥楡氠牯敤挠瑥整爠痩楮湯氠獥挠湯汣獵潩獮搠⁵牰ⷩ楤条潮瑳捩猠牴瑡柩煩敵മ慄獮挠瑥整愠瑴湥整敪瘠畯牰敩搠牣楯敲潍獮敩牵⃠❬獡畳慲据敤洠獥猠湥楴敭瑮敬敭汩敬牵䜉奕䰠䍅䱏ㄱ畲畇汩潬摵ⴠ㘠〹㌰䰠余⁎㨠〠⸴㈷㘮⸸㠰〮‸慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮畡挠灡瑩污搠㘵‶〰‰剆⹓删䌮匮慐楲䈠㌠㤷㔠㘶㜠ㄷ�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁㼁Ǡ☿ἁ쀁Aǰ⤏ā܇Sೀ˿āȀ̃萀㼡⇰'༁︃ἢ䏸àćĀⅿ܀老ἡǸćↀ缥ǰ䔃️℀쀃㼁༂�㼡⏰à�༡˼Ӽā⇀️ﰃā쀋̡ǰȃćↀEǀԟāǀȏā⇠︇�쀁̅︁Ēč︉�Ⅻ́耄༤⇀️㼂︣Ā⌏þ༁FǠ━ﰟ㼂耄ἡ˸˼П⇀︇܅̅老܁考ć老Ą㼊︦ĀʀӸăƀЇƀ⤃ĉAᎀ⾁︃܃︃�ɽǀȁğ␀༂耄ﰂ༥ϼ⇼ﰏﰃ쀆ἂ�ἤϸϸↀﰏ܂&̀�̉�</p>
> <p>ᨍ</p>
> <p></p>
> <p>ㄍ</p>
> <p>ㄱ畲畇汩潬摵ⴠ㘠〹㌰䰠余⁎㨠〠⸴㈷㘮⸸㠰〮‸慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮畡挠灡瑩污搠㘵‶〰‰剆⹓删䌮匮慐楲䈠㌠㤷㔠㘶㜠ㄷ�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁</p>
> <p></p>
> <p></p>
> <p></p>
> <p>ᨍ</p>
> </body></html>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)