Hi Luke,

-----Original Message-----
From: Luke <[email protected]>
Date: Wednesday, January 28, 2015 at 7:15 PM
To: Chris Mattmann <[email protected]>, Chris Mattmann
<[email protected]>, "[email protected]"
<[email protected]>
Cc: NSF Polar CyberInfrastructure DR Students
<[email protected]>
Subject: RE: [jira] [Commented] (TIKA-1535) Inheritance modification for
the class MIMETypes

>Hi Professor and all,
>
>Bayesian or machine learning Detector is different from Bayesian
>Selection mechanism reported in TIKA-1517.
>It would make sense if we implemented a machine learning algorithm in
>separate Detector class, I have not gone too far with this design
>thought, as I am still on the stage of the research with data collection,
>once I have enough data and am able to form a model, especially I am able
>to prove my concept, then I will be able to come down to the machine
>learning detector implementation with design consideration. (BTW, I think
>I have some ideas with data collection and training, it still takes some
>time to come up with something even quick and dirty that can prove the
>concept with machine learning, I am still working on the data collection,
>there are also some design problems within learning techniques too, I
>will come to them once I will have clear idea with the data, i think I
>may have to crawl the data and label them for training, there are some
>certain preprocessing steps to be cared too....)

+1.

>
>However, my current implementation in TIKA-1517 is solely based on mime
>type "selection"(I cannot find any clearer name disguisable from
>detection) with probability that might have nothing to do with the
>genuine machine learning detector, it is a feature for adding weights to
>each Tika mime type detection algorithm.

Gotcha.

> 
>
>But I think you are right, and in the future we kinda need it to assign
>weights to a pool of detection algorithms including machine learning
>techniques or content based detection algorithms, and the current
>implementation of MIMEtypes with final has its design purpose, and I
>don’t think it is a good idea to lump detector code within the MimeTypes,
>but I will come down to this design or architecture problem once I have
>some clear ideas of the machine learning model (not necessary Bayesian
>model for detection).
> 
>
>BTW, off the top of my head, I would tend to distill the detector
>semantics out of the MIMEtypes mentioned as below;
>What do you think about creating a say TikaDetector class independent
>from the MimeTypes, and get rid of MimeTypes from the
> detectors (i.e. getting rid of the "implements Detector" in the
>MimeTypes)?

Yes, can you explore doing this?

>
>I will continue to think about this design problem as we move alone, and
>I will leave notes on the ticket for sure. It looks like an important or
>big change, so any kind suggestion will be welcomed and appreciate

Thank you Luke, will do. I will read more and comment on it. Thanks for
sharing this with the list!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




>
>-----Original Message-----
>From: Christian Alan Mattmann [mailto:[email protected]]
>Sent: Wednesday, January 28, 2015 6:30 PM
>To: Luke; 'Mattmann, Chris A (3980)'
>Cc: [email protected]
>Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for
>the class MIMETypes
>
>Hi Luke, thanks much. I think we should be having this discussion on the
>[email protected] list too, but thanks also for CC’ing the Polar
>students list.
>
>My feeling is that Tyler has a good point and that having a
>BayesianDetector makes a ton of sense. How about we try that as a start,
>and see where it goes?
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Adjunct Associate Professor, Computer Science Department University of
>Southern California Los Angeles, CA 90089 USA
>Email: [email protected]
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: Luke <[email protected]>
>Date: Wednesday, January 28, 2015 at 5:48 PM
>To: Chris Mattmann <[email protected]>
>Cc: Chris Mattmann <[email protected]>, NSF Polar CyberInfrastructure DR
>Students <[email protected]>
>Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification for
>the class MIMETypes
>
>>Hi Professor,
>>
>>I was about to modify the code to be able to work with inheritance and
>>code reuse, Tyler in the following just came across and posted a
>>suggestion, which is a bit enlightening.
>>
>>Defining class with final in this case seems to tell me that any input
>>stream that gets passed to the class is attached to one fixed type of
>>MimeTypes (I tend to think the MimeTypes should be tied up with one
>>input stream), or it can be interpreted it as the MimeTypes of an input
>>stream.
>>If we inherit this by calling my implementation of
>>MimeTypesBaysianSelection, that will look weird in a sense of
>>inheritance. As my Bayesian implementation is more like an operation
>>attached to that input stream's MimeTypes.
>>
>>It seems MimeTypes class is not only used as a MimeType detector (it
>>implements Detector interface though), but it also has some other
>>purposes, eg. Users can take a peak on the input stream mimetypes,
>>extension, magics, etc, that is probably why it is called MimeTypes
>>rather than something like Detector; I think it is not a detector, but
>>some of its methods such as getMagics or something make it easier fit
>>into the slot of Detectors, as it is easier to just outfit it with an
>>Detector interface and just use it as one of detectors, I was initially
>>confused why it is not called something with detector in it and now I
>>am getting the idea....:) but if you have any thoughts, please kindly
>>let me know.
>>
>>It looks like a clearer oo design for me would be a detector class (say
>>TikaDetector) that take MimeTypes of an input stream as an argument,
>>and execute the detect method with MimeTypes of the input stream,
>>although the current detect method only takes an input stream as one of
>>its argument.... we can create create an MimeTypes instance inside this
>>detect method; However, this is my premature thought, and also if we
>>modify like this, I am afraid it is highly possible we would violate
>>some of the original design with mime and this will probably and
>>potentially break some of the semantics... , although I do feel the
>>current design has a few little flaws in this respect.
>>
>>On the other hands, if we stick to original implementation by attaching
>>the Bayesian selection function to the MimeTypes, after digging up a
>>bit I personally think this is a bit clearer than inheritance (getting
>>rid of the final). Probably this also minimizes the code change and
>>potential impact. Every time I make code change, I always fear there
>>would be a 'butterfly effect', thorough testing would be needed for
>>sure....which does take some time and it is quite tedious....  quite
>>important though....
>>
>>Anyway If you have any advice/idea/thoughts, please kindly let me know
>>and they will be welcomed and appreciated as usual.
>>
>>Thanks
>>Luke
>>
>>-----Original Message-----
>>From: Tyler Palsulich (JIRA) [mailto:[email protected]]
>>Sent: Wednesday, January 28, 2015 3:52 PM
>>To: [email protected]
>>Subject: [jira] [Commented] (TIKA-1535) Inheritance modification for
>>the class MIMETypes
>>
>>
>>    [
>>https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira
>>.pl 
>>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296084#c
>>omm
>>ent-14296084 ]
>>
>>Tyler Palsulich commented on TIKA-1535:
>>---------------------------------------
>>
>>Maybe someone else can comment on this too. But, I believe MimeTypes is
>>{{final}} because there are restrictions of what can be done with the
>>given {{InputStream}}. If those restrictions are broken, key features
>>of Tika may break. So, we declared the class as final to ensure no one
>>could break those semantics.
>>
>>But, as seen here, it's difficult to predict whether or not there will
>>be a valid {{extend}} use case. So, you have to be careful when marking
>>a class {{final}}.
>>
>>> Inheritance modification for the class MIMETypes
>>> ------------------------------------------------
>>>
>>>                 Key: TIKA-1535
>>>                 URL: https://issues.apache.org/jira/browse/TIKA-1535
>>>             Project: Tika
>>>          Issue Type: Improvement
>>>          Components: mime
>>>            Reporter: Luke sh
>>>            Priority: Trivial
>>>
>>> The Class MIMETypes does not currently allow for inheritance.
>>> There are a couple of methods in this class which looks independent,
>>>and some of which needs to be exposed or overwritten for special needs
>>>or use cases, this will enable tika users with more flexibility for
>>>new mime detection algorithm.
>>
>>
>>
>>--
>>This message was sent by Atlassian JIRA
>>(v6.3.4#6332)
>>
>
>

Reply via email to