Hi Professor and all, Bayesian or machine learning Detector is different from Bayesian Selection mechanism reported in TIKA-1517. It would make sense if we implemented a machine learning algorithm in separate Detector class, I have not gone too far with this design thought, as I am still on the stage of the research with data collection, once I have enough data and am able to form a model, especially I am able to prove my concept, then I will be able to come down to the machine learning detector implementation with design consideration. (BTW, I think I have some ideas with data collection and training, it still takes some time to come up with something even quick and dirty that can prove the concept with machine learning, I am still working on the data collection, there are also some design problems within learning techniques too, I will come to them once I will have clear idea with the data, i think I may have to crawl the data and label them for training, there are some certain preprocessing steps to be cared too....)
However, my current implementation in TIKA-1517 is solely based on mime type "selection"(I cannot find any clearer name disguisable from detection) with probability that might have nothing to do with the genuine machine learning detector, it is a feature for adding weights to each Tika mime type detection algorithm. But I think you are right, and in the future we kinda need it to assign weights to a pool of detection algorithms including machine learning techniques or content based detection algorithms, and the current implementation of MIMEtypes with final has its design purpose, and I don’t think it is a good idea to lump detector code within the MimeTypes, but I will come down to this design or architecture problem once I have some clear ideas of the machine learning model (not necessary Bayesian model for detection). BTW, off the top of my head, I would tend to distill the detector semantics out of the MIMEtypes mentioned as below; What do you think about creating a say TikaDetector class independent from the MimeTypes, and get rid of MimeTypes from the detectors (i.e. getting rid of the "implements Detector" in the MimeTypes)? I will continue to think about this design problem as we move alone, and I will leave notes on the ticket for sure. It looks like an important or big change, so any kind suggestion will be welcomed and appreciated. Thanks Luke -----Original Message----- From: Christian Alan Mattmann [mailto:[email protected]] Sent: Wednesday, January 28, 2015 6:30 PM To: Luke; 'Mattmann, Chris A (3980)' Cc: [email protected] Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Luke, thanks much. I think we should be having this discussion on the [email protected] list too, but thanks also for CC’ing the Polar students list. My feeling is that Tyler has a good point and that having a BayesianDetector makes a ton of sense. How about we try that as a start, and see where it goes? Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Adjunct Associate Professor, Computer Science Department University of Southern California Los Angeles, CA 90089 USA Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Luke <[email protected]> Date: Wednesday, January 28, 2015 at 5:48 PM To: Chris Mattmann <[email protected]> Cc: Chris Mattmann <[email protected]>, NSF Polar CyberInfrastructure DR Students <[email protected]> Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes >Hi Professor, > >I was about to modify the code to be able to work with inheritance and >code reuse, Tyler in the following just came across and posted a >suggestion, which is a bit enlightening. > >Defining class with final in this case seems to tell me that any input >stream that gets passed to the class is attached to one fixed type of >MimeTypes (I tend to think the MimeTypes should be tied up with one >input stream), or it can be interpreted it as the MimeTypes of an input stream. >If we inherit this by calling my implementation of >MimeTypesBaysianSelection, that will look weird in a sense of >inheritance. As my Bayesian implementation is more like an operation >attached to that input stream's MimeTypes. > >It seems MimeTypes class is not only used as a MimeType detector (it >implements Detector interface though), but it also has some other >purposes, eg. Users can take a peak on the input stream mimetypes, >extension, magics, etc, that is probably why it is called MimeTypes >rather than something like Detector; I think it is not a detector, but >some of its methods such as getMagics or something make it easier fit >into the slot of Detectors, as it is easier to just outfit it with an >Detector interface and just use it as one of detectors, I was initially >confused why it is not called something with detector in it and now I >am getting the idea....:) but if you have any thoughts, please kindly >let me know. > >It looks like a clearer oo design for me would be a detector class (say >TikaDetector) that take MimeTypes of an input stream as an argument, >and execute the detect method with MimeTypes of the input stream, >although the current detect method only takes an input stream as one of >its argument.... we can create create an MimeTypes instance inside this >detect method; However, this is my premature thought, and also if we >modify like this, I am afraid it is highly possible we would violate >some of the original design with mime and this will probably and >potentially break some of the semantics... , although I do feel the >current design has a few little flaws in this respect. > >On the other hands, if we stick to original implementation by attaching >the Bayesian selection function to the MimeTypes, after digging up a >bit I personally think this is a bit clearer than inheritance (getting >rid of the final). Probably this also minimizes the code change and >potential impact. Every time I make code change, I always fear there >would be a 'butterfly effect', thorough testing would be needed for >sure....which does take some time and it is quite tedious.... quite >important though.... > >Anyway If you have any advice/idea/thoughts, please kindly let me know >and they will be welcomed and appreciated as usual. > >Thanks >Luke > >-----Original Message----- >From: Tyler Palsulich (JIRA) [mailto:[email protected]] >Sent: Wednesday, January 28, 2015 3:52 PM >To: [email protected] >Subject: [jira] [Commented] (TIKA-1535) Inheritance modification for >the class MIMETypes > > > [ >https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira >.pl >ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296084#c >omm >ent-14296084 ] > >Tyler Palsulich commented on TIKA-1535: >--------------------------------------- > >Maybe someone else can comment on this too. But, I believe MimeTypes is >{{final}} because there are restrictions of what can be done with the >given {{InputStream}}. If those restrictions are broken, key features >of Tika may break. So, we declared the class as final to ensure no one >could break those semantics. > >But, as seen here, it's difficult to predict whether or not there will >be a valid {{extend}} use case. So, you have to be careful when marking >a class {{final}}. > >> Inheritance modification for the class MIMETypes >> ------------------------------------------------ >> >> Key: TIKA-1535 >> URL: https://issues.apache.org/jira/browse/TIKA-1535 >> Project: Tika >> Issue Type: Improvement >> Components: mime >> Reporter: Luke sh >> Priority: Trivial >> >> The Class MIMETypes does not currently allow for inheritance. >> There are a couple of methods in this class which looks independent, >>and some of which needs to be exposed or overwritten for special needs >>or use cases, this will enable tika users with more flexibility for >>new mime detection algorithm. > > > >-- >This message was sent by Atlassian JIRA >(v6.3.4#6332) >
