Hello Ken Krugler at "Tue, 15 Jun 2010 11:56:51 -0700" wrote: KK> I think this is a reasonable approach, as long as (per Alex's suggestion) it's KK> configurable in various ways.
KK> E.g. if you know you don't want to parse OLE2-based files, so you've removed jars for KK> those parser, then it would be great to have an easy way of disabling the (more KK> expensive) mime-type detection, and potentially avoid the dependency on these same jars. KK> Separately, I think this issue might also trigger improvements to the existing "magic KK> bytes" detection code in Tika. IIRC, we wound up adding full regex with some additional KK> matching rules in Krugle, to extend the (from Nutch, same as Tika) mime-type detection KK> code to better handle things like source code files. I imagine something similar might KK> be needed to reliably handle container matching. I'm not sure - does Tika need full regex support, while in most mime type detection tasks it's enough (from my experience in this branch) to have only search function dynamic addressing function (for example, find Zip signature somewhere, and then use mix of getByte(offset) to check other values) For source code it's better to use something like naive bayes - it works well (as I remember from tests, that we made 6 years ago)... -- With best wishes, Alex Ott, MBA http://alexott.blogspot.com/ http://alexott.net/ http://alexott-ru.blogspot.com/ Skype: alex.ott
