It seems reasonable to me. I think this is likely going to open a can of worms, but we should do better. It'd be interesting to look at the stats from tika-eval on the application/octet files in our regression corpus and see if they are actually language-y...how often is this happening?
On Tue, Jun 25, 2019 at 12:18 PM Ken Krugler <kkrugler_li...@transpac.com> wrote: > > Hi Tim, > > Seems like what we’d want is “isText()” vs what we’ve got, which is > “isAscii()” > > Any thoughts on switching to what I thought was the older algorithm, of (a) > not many unexpected control chars, and (b) a reasonable number of line ending > chars? > > — Ken > > > On Jun 25, 2019, at 6:56 AM, Tim Allison <talli...@apache.org> wrote: > > > > Hi Ken, > > I'm sorry for my delay. I took a short chunk of Japanese and > > converted it to Shift_JIS. > > > > Your memory is largely correct (or we've changed the code base a > > bit). The TextDetector makes a decision in favor of {{text/plain}} vs > > {{application/octet}} via TextStatistics > > (https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46) > > if the bytes are: > > > > a) mostly in the ascii range (btwn 0x20 and 128) and don't have too > > many control characters > > b) kind of look like UTF-8 > > > > In the example file I used, there were 0 control, 36 ascii (btwn 0x20 > > and 128) an 0 safe terms, but the total character count was 218. The > > isAscii() requires > 90% of the characters appear btwn 0x20 and > > 128...so the text detector failed. > > > > In short, this is an area for improvement. I suspect our current > > mechanism would also be pretty awful on UTF-16. > > > > On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler <kkrugler_li...@transpac.com> > > wrote: > >> > >> Hi devs, > >> > >> I’m trying to remember the history of how Tika’s current mime-type > >> detection has evolved, regarding handling of plain text files. > >> > >> Currently if I run a Shift-JIS encoded file through Tika (suffix is > >> “.env”) it gets returned as application/octet-stream. > >> > >> I thought that previously we had something which would check if the file > >> only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars > >> besides these), and a reasonable number of line ending chars, and if so > >> then we’d return text/plain instead of application/octet-stream > >> > >> Thanks, > >> > >> — Ken > >> > >> -------------------------- > >> Ken Krugler > >> +1 530-210-6378 > >> http://www.scaleunlimited.com > >> Custom big data solutions & training > >> Flink, Solr, Hadoop, Cascading & Cassandra > >> > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > Custom big data solutions & training > Flink, Solr, Hadoop, Cascading & Cassandra >