Hi Maxim, On Fri, 2014-05-02 at 12:41 +0300, Maxim Monastirsky wrote: > On Thursday 01 May 2014 09:29:48 Kohei Yoshida wrote: > > So, I looked over those changes, and I do like the changes. :-) > Thanks Kohei! > > > He was concerned about having to "detect" zip > > storage over and over again which he rightly said was not great for > > performance. > > It makes me think of another point. There are some detectors that do exactly > the same detection procedure for all supported types. For example - oox, xml, > and now the new storage one. If such detector didn't detect anything useful > once, we can be sure that it won't detect anything also in the next runs. So > it doesn't make sense to run it again and again.
I agree. I think it makes sense to leave some data such as * this is (not) a zip storage. * this is (not) a valid ooxml format. * this is (not) a valid ODF format. * this is (not) a valid BIFF storage. etc., and I can imagine storing these pieces of information with the MediaDescriptor instance to help the subsequent detectors to skip redundant detection routines. Actually maybe we could just specify the type of detected storage type such as "DetectedStorage" + not detected -> detector should try to detect and store the result. + zip + gzip + biff + etc "DetectedXMLType" + not detected -> detector should try to detect the XML type and store the result. + ODF + OOXML so that we can just store all this information using just one slot of the MediaDescriptor rather than storing multiple boolean values. Having said that, I don't think we have to go to the extent that "hey, this is definitely not "XYZ format", don't bother trying to detect it". The idea itself may make sense, but the way the detection services are currently set up would make it a bit challenging to implement such additoinal checks. And since the number of file formats to detect against is quite small (~120), simply iterating over all of them should not cause a performance issue once we put the above mechanism to avoid redundant checks. > Maybe we can store a list of such detectors in some config file, and add a > corresponding check to the detection loop. This also would be a bit cleaner > solution for fdo#46310. What is the best place to store such list? We already have a list of detectors, and they are sorted in order of complexity for strategic reasons. filter/source/config/cache/typedetection.cxx is the place where the list is stored and maintained. But as I said above, I'd like us to try the above mechanim first and see if that will improve the situation a bit. I'm a bit cautious with trying to either shorten or reorder this master detector list since I've seen doing such things caused quite hard-to-debug (and fix) format detection bugs in the past. Best, Kohei _______________________________________________ LibreOffice mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/libreoffice
