Re: Detecting container formats

Alex Ott Tue, 15 Jun 2010 12:46:17 -0700

Hello

Ken Krugler  at "Tue, 15 Jun 2010 11:56:51 -0700" wrote:
 KK> I think this is a reasonable approach, as long as (per Alex's suggestion) 
it's
 KK> configurable in various ways.


 KK> E.g. if you know you don't want to parse OLE2-based files, so you've 
removed jars for
 KK> those parser, then it would be great to have an easy  way of disabling the 
(more
 KK> expensive) mime-type detection, and  potentially avoid the dependency on 
these same jars.

 KK> Separately, I think this issue might also trigger improvements to the 
existing "magic
 KK> bytes" detection code in Tika. IIRC, we wound up  adding full regex with 
some additional
 KK> matching rules in Krugle, to  extend the (from Nutch, same as Tika) 
mime-type detection
 KK> code to  better handle things like source code files. I imagine something  
similar might
 KK> be needed to reliably handle container matching.

I'm not sure - does Tika need full regex support, while in most mime type
detection tasks it's enough (from my experience in this branch) to have
only search function dynamic addressing function (for example, find Zip
signature somewhere, and then use mix of getByte(offset) to check other
values)

For source code it's better to use something like naive bayes - it works
well (as I remember from tests, that we made 6 years ago)...

-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/
http://alexott-ru.blogspot.com/
Skype: alex.ott

Re: Detecting container formats

Reply via email to