Hi Luke,
-----Original Message----- From: Luke <[email protected]> Date: Wednesday, January 28, 2015 at 7:15 PM To: Chris Mattmann <[email protected]>, Chris Mattmann <[email protected]>, "[email protected]" <[email protected]> Cc: NSF Polar CyberInfrastructure DR Students <[email protected]> Subject: RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes >Hi Professor and all, > >Bayesian or machine learning Detector is different from Bayesian >Selection mechanism reported in TIKA-1517. >It would make sense if we implemented a machine learning algorithm in >separate Detector class, I have not gone too far with this design >thought, as I am still on the stage of the research with data collection, >once I have enough data and am able to form a model, especially I am able >to prove my concept, then I will be able to come down to the machine >learning detector implementation with design consideration. (BTW, I think >I have some ideas with data collection and training, it still takes some >time to come up with something even quick and dirty that can prove the >concept with machine learning, I am still working on the data collection, >there are also some design problems within learning techniques too, I >will come to them once I will have clear idea with the data, i think I >may have to crawl the data and label them for training, there are some >certain preprocessing steps to be cared too....) +1. > >However, my current implementation in TIKA-1517 is solely based on mime >type "selection"(I cannot find any clearer name disguisable from >detection) with probability that might have nothing to do with the >genuine machine learning detector, it is a feature for adding weights to >each Tika mime type detection algorithm. Gotcha. > > >But I think you are right, and in the future we kinda need it to assign >weights to a pool of detection algorithms including machine learning >techniques or content based detection algorithms, and the current >implementation of MIMEtypes with final has its design purpose, and I >don’t think it is a good idea to lump detector code within the MimeTypes, >but I will come down to this design or architecture problem once I have >some clear ideas of the machine learning model (not necessary Bayesian >model for detection). > > >BTW, off the top of my head, I would tend to distill the detector >semantics out of the MIMEtypes mentioned as below; >What do you think about creating a say TikaDetector class independent >from the MimeTypes, and get rid of MimeTypes from the > detectors (i.e. getting rid of the "implements Detector" in the >MimeTypes)? Yes, can you explore doing this? > >I will continue to think about this design problem as we move alone, and >I will leave notes on the ticket for sure. It looks like an important or >big change, so any kind suggestion will be welcomed and appreciate Thank you Luke, will do. I will read more and comment on it. Thanks for sharing this with the list! Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >-----Original Message----- >From: Christian Alan Mattmann [mailto:[email protected]] >Sent: Wednesday, January 28, 2015 6:30 PM >To: Luke; 'Mattmann, Chris A (3980)' >Cc: [email protected] >Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for >the class MIMETypes > >Hi Luke, thanks much. I think we should be having this discussion on the >[email protected] list too, but thanks also for CC’ing the Polar >students list. > >My feeling is that Tyler has a good point and that having a >BayesianDetector makes a ton of sense. How about we try that as a start, >and see where it goes? > >Cheers, >Chris > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Adjunct Associate Professor, Computer Science Department University of >Southern California Los Angeles, CA 90089 USA >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > >-----Original Message----- >From: Luke <[email protected]> >Date: Wednesday, January 28, 2015 at 5:48 PM >To: Chris Mattmann <[email protected]> >Cc: Chris Mattmann <[email protected]>, NSF Polar CyberInfrastructure DR >Students <[email protected]> >Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification for >the class MIMETypes > >>Hi Professor, >> >>I was about to modify the code to be able to work with inheritance and >>code reuse, Tyler in the following just came across and posted a >>suggestion, which is a bit enlightening. >> >>Defining class with final in this case seems to tell me that any input >>stream that gets passed to the class is attached to one fixed type of >>MimeTypes (I tend to think the MimeTypes should be tied up with one >>input stream), or it can be interpreted it as the MimeTypes of an input >>stream. >>If we inherit this by calling my implementation of >>MimeTypesBaysianSelection, that will look weird in a sense of >>inheritance. As my Bayesian implementation is more like an operation >>attached to that input stream's MimeTypes. >> >>It seems MimeTypes class is not only used as a MimeType detector (it >>implements Detector interface though), but it also has some other >>purposes, eg. Users can take a peak on the input stream mimetypes, >>extension, magics, etc, that is probably why it is called MimeTypes >>rather than something like Detector; I think it is not a detector, but >>some of its methods such as getMagics or something make it easier fit >>into the slot of Detectors, as it is easier to just outfit it with an >>Detector interface and just use it as one of detectors, I was initially >>confused why it is not called something with detector in it and now I >>am getting the idea....:) but if you have any thoughts, please kindly >>let me know. >> >>It looks like a clearer oo design for me would be a detector class (say >>TikaDetector) that take MimeTypes of an input stream as an argument, >>and execute the detect method with MimeTypes of the input stream, >>although the current detect method only takes an input stream as one of >>its argument.... we can create create an MimeTypes instance inside this >>detect method; However, this is my premature thought, and also if we >>modify like this, I am afraid it is highly possible we would violate >>some of the original design with mime and this will probably and >>potentially break some of the semantics... , although I do feel the >>current design has a few little flaws in this respect. >> >>On the other hands, if we stick to original implementation by attaching >>the Bayesian selection function to the MimeTypes, after digging up a >>bit I personally think this is a bit clearer than inheritance (getting >>rid of the final). Probably this also minimizes the code change and >>potential impact. Every time I make code change, I always fear there >>would be a 'butterfly effect', thorough testing would be needed for >>sure....which does take some time and it is quite tedious.... quite >>important though.... >> >>Anyway If you have any advice/idea/thoughts, please kindly let me know >>and they will be welcomed and appreciated as usual. >> >>Thanks >>Luke >> >>-----Original Message----- >>From: Tyler Palsulich (JIRA) [mailto:[email protected]] >>Sent: Wednesday, January 28, 2015 3:52 PM >>To: [email protected] >>Subject: [jira] [Commented] (TIKA-1535) Inheritance modification for >>the class MIMETypes >> >> >> [ >>https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira >>.pl >>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296084#c >>omm >>ent-14296084 ] >> >>Tyler Palsulich commented on TIKA-1535: >>--------------------------------------- >> >>Maybe someone else can comment on this too. But, I believe MimeTypes is >>{{final}} because there are restrictions of what can be done with the >>given {{InputStream}}. If those restrictions are broken, key features >>of Tika may break. So, we declared the class as final to ensure no one >>could break those semantics. >> >>But, as seen here, it's difficult to predict whether or not there will >>be a valid {{extend}} use case. So, you have to be careful when marking >>a class {{final}}. >> >>> Inheritance modification for the class MIMETypes >>> ------------------------------------------------ >>> >>> Key: TIKA-1535 >>> URL: https://issues.apache.org/jira/browse/TIKA-1535 >>> Project: Tika >>> Issue Type: Improvement >>> Components: mime >>> Reporter: Luke sh >>> Priority: Trivial >>> >>> The Class MIMETypes does not currently allow for inheritance. >>> There are a couple of methods in this class which looks independent, >>>and some of which needs to be exposed or overwritten for special needs >>>or use cases, this will enable tika users with more flexibility for >>>new mime detection algorithm. >> >> >> >>-- >>This message was sent by Atlassian JIRA >>(v6.3.4#6332) >> > >
