Re: Detection of plain text files
Hi Tim, Seems like what we’d want is “isText()” vs what we’ve got, which is “isAscii()” Any thoughts on switching to what I thought was the older algorithm, of (a) not many unexpected control chars, and (b) a reasonable number of line ending chars? — Ken > On Jun 25, 2019, at 6:56 AM, Tim Allison wrote: > > Hi Ken, > I'm sorry for my delay. I took a short chunk of Japanese and > converted it to Shift_JIS. > > Your memory is largely correct (or we've changed the code base a > bit). The TextDetector makes a decision in favor of {{text/plain}} vs > {{application/octet}} via TextStatistics > (https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46) > if the bytes are: > > a) mostly in the ascii range (btwn 0x20 and 128) and don't have too > many control characters > b) kind of look like UTF-8 > > In the example file I used, there were 0 control, 36 ascii (btwn 0x20 > and 128) an 0 safe terms, but the total character count was 218. The > isAscii() requires > 90% of the characters appear btwn 0x20 and > 128...so the text detector failed. > > In short, this is an area for improvement. I suspect our current > mechanism would also be pretty awful on UTF-16. > > On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler > wrote: >> >> Hi devs, >> >> I’m trying to remember the history of how Tika’s current mime-type detection >> has evolved, regarding handling of plain text files. >> >> Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) >> it gets returned as application/octet-stream. >> >> I thought that previously we had something which would check if the file >> only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars >> besides these), and a reasonable number of line ending chars, and if so then >> we’d return text/plain instead of application/octet-stream >> >> Thanks, >> >> — Ken >> >> -- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> Custom big data solutions & training >> Flink, Solr, Hadoop, Cascading & Cassandra >> -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
Re: [EXTERNAL] Re: Tika 1.22?
Looks good… From: Oleg Tikhonov Reply-To: "dev@tika.apache.org" Date: Tuesday, June 25, 2019 at 7:57 AM To: "dev@tika.apache.org" Subject: [EXTERNAL] Re: Tika 1.22? Would be great!!! Cheers, Oleg On Tue, Jun 25, 2019, 17:45 Tim Allison wrote: All, The vote for the next version of PDFBox is under way. I think we've had a number of useful upgrades since our last release. Any objections to starting the release process for Tika 1.22 a week or so after we integrate PDFBox? Cheers, Tim
Re: Tika 1.22?
Would be great!!! Cheers, Oleg On Tue, Jun 25, 2019, 17:45 Tim Allison wrote: > All, > The vote for the next version of PDFBox is under way. I think we've > had a number of useful upgrades since our last release. Any > objections to starting the release process for Tika 1.22 a week or so > after we integrate PDFBox? > > Cheers, > > Tim >
Re: Tika 1.22?
Sounds good Thanks, Sergey On Tue, Jun 25, 2019 at 3:45 PM Tim Allison wrote: > All, > The vote for the next version of PDFBox is under way. I think we've > had a number of useful upgrades since our last release. Any > objections to starting the release process for Tika 1.22 a week or so > after we integrate PDFBox? > > Cheers, > > Tim >
[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872418#comment-16872418 ] Tim Allison commented on TIKA-2790: --- bq. For an apples-to-apples comparison with OpenNLP, I guess you'd have to load the same 103 language models that they support (or some intersection of the same?) Y. Absolutely. The initial comparison was "out of the box"* ... apples to oranges. *With the one exception that I loaded all of Yalder's languages, including the extras. I wanted to see, initially, what happens if we take the packages off the shelf. I agree that it would be better to do a follow-on apples-apples. :) bq. as yalder is slower than Optimaize & OpenNLP when early termination is disabled, This has been puzzling me as well. My _guess_ is that Yalder is updating the stats with every new known ngram, rather than batching counts. But there may very well be something else going on, including the 2x number of languages that Yalder was handling! bq. and even slower on short text with early termination I'd want to do quite a bit more benchmarking on short texts to confirm this generally. I worry about micro-benchmarking pitfalls. I am more comfortable with the results on longer chunks of text. > Consider switching lang-detection in tika-eval to open-nlp > -- > > Key: TIKA-2790 > URL: https://issues.apache.org/jira/browse/TIKA-2790 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > Attachments: fra_mixed_10_0.0_0.txt, hasEnough.png, > langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, > langid_20190514_plus_minus_1.zip, timeVsLength.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Tika 1.22?
All, The vote for the next version of PDFBox is under way. I think we've had a number of useful upgrades since our last release. Any objections to starting the release process for Tika 1.22 a week or so after we integrate PDFBox? Cheers, Tim
Re: Detection of plain text files
Hi Ken, I'm sorry for my delay. I took a short chunk of Japanese and converted it to Shift_JIS. Your memory is largely correct (or we've changed the code base a bit). The TextDetector makes a decision in favor of {{text/plain}} vs {{application/octet}} via TextStatistics (https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46) if the bytes are: a) mostly in the ascii range (btwn 0x20 and 128) and don't have too many control characters b) kind of look like UTF-8 In the example file I used, there were 0 control, 36 ascii (btwn 0x20 and 128) an 0 safe terms, but the total character count was 218. The isAscii() requires > 90% of the characters appear btwn 0x20 and 128...so the text detector failed. In short, this is an area for improvement. I suspect our current mechanism would also be pretty awful on UTF-16. On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler wrote: > > Hi devs, > > I’m trying to remember the history of how Tika’s current mime-type detection > has evolved, regarding handling of plain text files. > > Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) > it gets returned as application/octet-stream. > > I thought that previously we had something which would check if the file only > had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars besides > these), and a reasonable number of line ending chars, and if so then we’d > return text/plain instead of application/octet-stream > > Thanks, > > — Ken > > -- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > Custom big data solutions & training > Flink, Solr, Hadoop, Cascading & Cassandra >