Re: Detection of plain text files

2019-06-25 Thread Ken Krugler
Hi Tim,

Seems like what we’d want is “isText()” vs what we’ve got, which is “isAscii()”

Any thoughts on switching to what I thought was the older algorithm, of (a) not 
many unexpected control chars, and (b) a reasonable number of line ending chars?

— Ken

> On Jun 25, 2019, at 6:56 AM, Tim Allison  wrote:
> 
> Hi Ken,
>  I'm sorry for my delay.  I took a short chunk of Japanese and
> converted it to Shift_JIS.
> 
>  Your memory is largely correct (or we've changed the code base a
> bit).  The TextDetector makes a decision in favor of {{text/plain}} vs
> {{application/octet}} via TextStatistics
> (https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46)
> if the bytes are:
> 
> a) mostly in the ascii range (btwn 0x20 and 128) and don't have too
> many control characters
> b) kind of look like UTF-8
> 
> In the example file I used, there were 0 control, 36 ascii (btwn 0x20
> and 128) an 0 safe terms, but the total character count was 218.  The
> isAscii() requires > 90% of the characters appear btwn 0x20 and
> 128...so the text detector failed.
> 
> In short, this is an area for improvement.  I suspect our current
> mechanism would also be pretty awful on UTF-16.
> 
> On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler  
> wrote:
>> 
>> Hi devs,
>> 
>> I’m trying to remember the history of how Tika’s current mime-type detection 
>> has evolved, regarding handling of plain text files.
>> 
>> Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) 
>> it gets returned as application/octet-stream.
>> 
>> I thought that previously we had something which would check if the file 
>> only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars 
>> besides these), and a reasonable number of line ending chars, and if so then 
>> we’d return text/plain instead of application/octet-stream
>> 
>> Thanks,
>> 
>> — Ken
>> 
>> --
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>> 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra



Re: [EXTERNAL] Re: Tika 1.22?

2019-06-25 Thread Chris Mattmann
Looks good…

 

 

 

From: Oleg Tikhonov 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, June 25, 2019 at 7:57 AM
To: "dev@tika.apache.org" 
Subject: [EXTERNAL] Re: Tika 1.22?

 

Would be great!!!

Cheers,

Oleg

 

On Tue, Jun 25, 2019, 17:45 Tim Allison  wrote:

 

All,

   The vote for the next version of PDFBox is under way.  I think we've

had a number of useful upgrades since our last release.  Any

objections to starting the release process for Tika 1.22 a week or so

after we integrate PDFBox?

 

  Cheers,

 

   Tim

 

 



Re: Tika 1.22?

2019-06-25 Thread Oleg Tikhonov
Would be great!!!
Cheers,
Oleg

On Tue, Jun 25, 2019, 17:45 Tim Allison  wrote:

> All,
>   The vote for the next version of PDFBox is under way.  I think we've
> had a number of useful upgrades since our last release.  Any
> objections to starting the release process for Tika 1.22 a week or so
> after we integrate PDFBox?
>
>  Cheers,
>
>   Tim
>


Re: Tika 1.22?

2019-06-25 Thread Sergey Beryozkin
Sounds good

Thanks, Sergey

On Tue, Jun 25, 2019 at 3:45 PM Tim Allison  wrote:

> All,
>   The vote for the next version of PDFBox is under way.  I think we've
> had a number of useful upgrades since our last release.  Any
> objections to starting the release process for Tika 1.22 a week or so
> after we integrate PDFBox?
>
>  Cheers,
>
>   Tim
>


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-25 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872418#comment-16872418
 ] 

Tim Allison commented on TIKA-2790:
---

bq.  For an apples-to-apples comparison with OpenNLP, I guess you'd have to 
load the same 103 language models that they support (or some intersection of 
the same?)

Y.  Absolutely. The initial comparison was "out of the box"* ... apples to 
oranges. *With the one exception that I loaded all of Yalder's languages, 
including the extras.  I wanted to see, initially, what happens if we take the 
packages off the shelf.  I agree that it would be better to do a follow-on 
apples-apples. :)

bq. as yalder is slower than Optimaize & OpenNLP when early termination is 
disabled,
This has been puzzling me as well.  My _guess_ is that Yalder is updating the 
stats with every new known ngram, rather than batching counts.  But there may 
very well be something else going on, including the 2x number of languages that 
Yalder was handling!

bq.  and even slower on short text with early termination
I'd want to do quite a bit more benchmarking on short texts to confirm this 
generally.  I worry about micro-benchmarking pitfalls.  I am more comfortable 
with the results on longer chunks of text.



> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Tika 1.22?

2019-06-25 Thread Tim Allison
All,
  The vote for the next version of PDFBox is under way.  I think we've
had a number of useful upgrades since our last release.  Any
objections to starting the release process for Tika 1.22 a week or so
after we integrate PDFBox?

 Cheers,

  Tim


Re: Detection of plain text files

2019-06-25 Thread Tim Allison
Hi Ken,
  I'm sorry for my delay.  I took a short chunk of Japanese and
converted it to Shift_JIS.

  Your memory is largely correct (or we've changed the code base a
bit).  The TextDetector makes a decision in favor of {{text/plain}} vs
{{application/octet}} via TextStatistics
(https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46)
if the bytes are:

a) mostly in the ascii range (btwn 0x20 and 128) and don't have too
many control characters
b) kind of look like UTF-8

In the example file I used, there were 0 control, 36 ascii (btwn 0x20
and 128) an 0 safe terms, but the total character count was 218.  The
isAscii() requires > 90% of the characters appear btwn 0x20 and
128...so the text detector failed.

In short, this is an area for improvement.  I suspect our current
mechanism would also be pretty awful on UTF-16.

On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler  wrote:
>
> Hi devs,
>
> I’m trying to remember the history of how Tika’s current mime-type detection 
> has evolved, regarding handling of plain text files.
>
> Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) 
> it gets returned as application/octet-stream.
>
> I thought that previously we had something which would check if the file only 
> had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars besides 
> these), and a reasonable number of line ending chars, and if so then we’d 
> return text/plain instead of application/octet-stream
>
> Thanks,
>
> — Ken
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>