Re: [memex-jpl] this week action from luke

Chris Mattmann Thu, 23 Apr 2015 07:22:07 -0700

Great work Luke and both of these changes make sense.
Please send the pull request for that thank you!


Great work Giuseppe! Go team!

Cheers,
Chris

------------------------
Chris Mattmann
[email protected]




-----Original Message-----
From: Luke <[email protected]>
Date: Thursday, April 23, 2015 at 3:08 AM
To: 'Luke' <[email protected]>, Chris Mattmann
<[email protected]>, Chris Mattmann
<[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
<[email protected]>, <[email protected]>, "'Bryant, Ann C
(398G-Affiliate)'" <[email protected]>, "'Zimdars, Paul A
(3980-Affiliate)'" <[email protected]>, NSF Polar
CyberInfrastructure DR Students <[email protected]>,
<[email protected]>
Subject: RE: [memex-jpl] this week action from luke

>Both patches from Guiseppe all works based on my tests;  from the tests I
>was able to see the magic tag was being appended at the beginning of the
>file, and the cbor extension was being appended too when running the Nutch
>dump tool command with the "-extension cbor" option. Thanks a lot for the
>kind help, Giuseppe, highly appreciated. I want to please give a big thumb
>up to Guiseppe's work, it is thorough and considerate too.
>
>To professor, 
>with Guiseppe's two patches, we still need to make a bit change in Tika
>mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika
>as
>it does not look very common, even if it accidentally appears in some
>other
>type of files, tika will have extension and metadatahint as a fallback
>strategy). I am going to send another pull request with that change;
>But before that, it will be great to elaborate what I am going to change
>to
>avoid any confusion.
>
>Now we have two problems.
>Problem1: Magic priority 40.
>       The application/xhtml+xml has higher priority(50) than
>application/cbor (40); [I don't know who (and why) assigned 40 to cbor];
>So
>if xhtml gets read and compared first,  cbor will not even be placed in
>the
>magic estimation list because it has low priority. Based on the tests, it
>turns out that it is true that xhtml gets read and compared first with the
>input file, so any type below the priority 50 will be disregarded.
>
>
>Problem2: again magic priority with 50.
>       In Tika, given a file dumped by the nutch dumper tool,  both types
>(xhtml and cbor) will be selected as candidate mime types and they will be
>put in the magic estimation list; since xhtml type gets read first, it is
>placed atop the cbor; in order to break that tie, tika will rely on the
>decision from the extension method. If the extension method fails to
>detect
>the type(for now, let's ignore metadata hint method for simplicity but the
>same applies to it too), then xhtml gets returned eventually.
>
>My pull request to be sent : I am going to set the magic priority of cbor
>type to 50 the same as xhtml, because it would probably be risky to
>discard
>any one of the estimated types without going consult the extension method.
>
>Any comments, suggestion, thoughts will be welcomed and appreciated.
>
>Thanks
>Luke
>
>-----Original Message-----
>From: Luke [mailto:[email protected]]
>Sent: Wednesday, April 22, 2015 7:45 PM
>To: 'Mattmann, Chris A (3980)'
>Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)';
>'[email protected]'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>'[email protected]'
>Subject: RE: [memex-jpl] this week action from luke
>
>Hi Prof,
>
>The test was finished, the result is expected.
>Both (tika with the prob feature and the one without it) produced the same
>"stats total", please see the attached matched.txt dumped by the small
>program that verbatim checks and compares each line in every section of
>the
>"Stats total" between the log produced by the tika that has the feature
>and
>the one without it;  so if the string.equals(...) satisfies, the string of
>the line will be dumped out. If there is a mismatch(e.g. the count for a
>particular mime type is different), an error will be dumped out.
>Eventually,
>I don't see any error in the printout, I think the feature seem to have
>passed the test.
>
>
>The processing time between 2 tests is as follows.
>The following shows the start time and end time for the test where the
>Nutch
>dumper tool with the prob selection feature.
>from
>2015-04-22 15:47:08,330
>to
>2015-04-22 17:48:28,877
>
>The following shows the start time and end time for the test where the
>Nutch
>dumper tool without the tika with the feature.
>from
>2015-04-22 22:41:23,459
>to
>2015-04-23 00:11:02,767
>
>
>BTW, I forgot to mention that probabilistic mime selector with default
>weight settings also gives the following result, because by default I
>intentionally assign \ a higher weight value on the magic bytes method so
>as
>to make it work in a way similar to the old strategy. On the other hands,
>if
>I know that extension is more reliable, I can certainly add more weights
>to
>the extension approach, in this case, the prob mime selector will return
>application/cbor with a higher value of weight.
>
>> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
>> Result: "text/html"
>> 
>> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
>> Result: "application/xhtml+xml"
>
>
>Please kindly let me know if you have any confusion with the tests;
>
>
>Thanks
>Luke
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:[email protected]]
>Sent: Wednesday, April 22, 2015 3:49 PM
>To: Luke
>Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
>[email protected]; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
>[email protected]
>Subject: Re: [memex-jpl] this week action from luke
>
>Thanks Luke this is probably a good opportunity to test out your Bayesian
>mime detector how does it perform here?
>
>Sent from my iPhone
>
>> On Apr 22, 2015, at 3:29 PM, Luke <[email protected]> wrote:
>> 
>> Hi professor,
>> 
>> Please see the following results.
>> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
>> Result: "text/html"
>> 
>> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
>> Result: "application/xhtml+xml"
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:[email protected]]
>> Sent: Wednesday, April 22, 2015 4:21 AM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U
>> (3980-Affiliate)'; [email protected]
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>> [email protected]
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Hi Luke,
>> 
>> Actually I just meant go into tika-mimetypes.xml and change the magic
>offsets for application/xhtml+xml and see if that works. The code you
>changed below is actually how many bytes Tika will first download to do
>MIME
>checking.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> [email protected]
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <[email protected]>
>> Date: Wednesday, April 22, 2015 at 2:25 AM
>> To: Chris Mattmann <[email protected]>, Chris Mattmann
><[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <[email protected]>, <[email protected]>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>,
>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>,
>> NSF Polar CyberInfrastructure DR Students
>> <[email protected]>,
>> <[email protected]>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> 
>>> Hi professor,
>>> 
>>> I just tried it with minLength set to 1024, I get the following
>>> "text/plain"
>>> I am a bit surprised....
>>> 
>>> BTW, the 6000 min length still give "application/xhtml+xml"; with
>>> anything below 1024 min length, I am seeing "text/plain". :)
>>> 
>>> BTW, the min length I am referring/altering is as follows
>>> MimeTypes.java
>>>    public int getMinLength() {
>>>       // This needs to be reasonably large to be able to correctly
>>> detect
>>>       // things like XML root elements after initial comment and DTDs
>>>       return 64 * 1024;
>>>   }
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> -----Original Message-----
>>> From: Chris Mattmann [mailto:[email protected]]
>>> Sent: Tuesday, April 21, 2015 7:48 PM
>>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U
>>> (3980-Affiliate)'; [email protected]
>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>> [email protected]
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Thanks Luke.
>>> 
>>> So I guess all I was asking was could you try it out. Thanks for the
>>> lesson in the RFC.
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ------------------------
>>> Chris Mattmann
>>> [email protected]
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <[email protected]>
>>> Date: Wednesday, April 22, 2015 at 1:46 AM
>>> To: Chris Mattmann <[email protected]>, Chris Mattmann
>>> <[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>>> <[email protected]>, <[email protected]>
>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>,
>>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>,
>>> NSF Polar CyberInfrastructure DR Students
>>> <[email protected]>,
>>> <[email protected]>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi professor,
>>>> 
>>>> 
>>>> I think it highly depends on the content being read by tika, e.g. if
>>>> there is a sequence of bytes in the file that is being read and is
>>>> the same as one or more of mime types being defined in our
>>>> tika-mimes.xml, I guess that tika will put those types in its
>>>> estimation list, please note there could be multiple estimated mime
>>>> types by magic-byte detection approach. Now tika also considers the
>>>> decision made by extension detection approach, if extension says the
>>>> file type it believes is the first one in the magic type estimation
>>>> list, then certainly the first one will be returned. (the same
>>>> applies to metadata hint approach); Of course, tika also prefers the
>>>> type that is the most specialized.
>>>> 
>>>> let's get back to the following question, here is my guess though.
>>>> [Prof]: Also what happens if you tweak the definition of XHTML to
>>>> not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over
>then?
>>>> Let's consider an extreme case where we only scan 10 or 1 bytes,
>>>> then it seems that magic bytes will inevitable detect nothing, and I
>>>> think it will return the something like" application/oct-stream"
>>>> that is the most general type. As mentioned, tika favours the one
>>>> that is the most specialized, if extension approach returns the one
>>>> that is more specialized, in this extreme case I believe almost
>>>> every type is a subclass of this "application/oct-stream"....
>>>> therefore the answer in this extreme may be yes, I think it is very
>>>> possible that CBOR type detected by the extension approach takes over
>>>>in
>this case...
>>>> 
>>>> My idea was and still is that if the cbor self-Describing tag 55799
>>>> is present in the cbor file, then that can be used to detect the cbor
>type.
>>>> Again, the cbor type will probably be appended into the magic
>>>> estimation list together with another one such as application/html,
>>>> I guess the order in the list probably also matters, the first one
>>>> is preferred over the next one. Also the decision from the extension
>>>> detection approach also play the role the break the tie.
>>>> e.g. if extension detection method agrees on cbor with one of the
>>>> estimated type in the magic list, then cbor will be returned.
>>>> (again, same thing applies to metadatahint method).
>>>> 
>>>> I have not taken a closer look at a cbor file that has the tag
>>>> 55799, but I expect to see its hex is something like 0xd9d9f7 or the
>>>> tag should be present in the header with a fixed sequence of
>>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this
>>>> is present in the file or preferable in the header (within a
>>>> reasonable range of bytes ), I believe it can probably be used as
>>>> the magic numbers for the cbor type.
>>>> 
>>>> 
>>>> There is another thing I have mentioned in the jira ticket I opened
>>>> yesterday against the cbor parser and detection, it is also possible
>>>> that cbor content can be imbedded inside a plain json file, the way
>>>> that a decoder can distinguish them in that file is by looking at
>>>> the tag 55799 again. This may rarely happen but a robust parser
>>>> might be able to take care of that, tika might need to consider the
>>>> use of fastXML being used by the nutch tool when developing the cbor
>parser...
>>>> Again let me cite the same paragraph from the rfc,
>>>> 
>>>> " a decoder might be able to parse both CBOR and JSON.
>>>>  Such a decoder would need to mechanically distinguish the two
>>>> formats.  An easy way for an encoder to help the decoder would be to
>>>> tag the entire CBOR item with tag 55799, the serialization of which
>>>> will never be found at the beginning of a JSON text."
>>>> 
>>>> 
>>>> Thanks
>>>> Luke
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Mattmann, Chris A (3980)
>>>> [mailto:[email protected]]
>>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>>>> (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students';
>>>> [email protected]
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hi Luke,
>>>> 
>>>> Can you post the below conversation to dev@tika and summarize it
>>>>there.
>>>> Also what happens if you tweak the definition of XHTML to not scan
>>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398) NASA Jet
>>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: [email protected]
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department University
>>>> of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <[email protected]>
>>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U
>>>> (3980-Affiliate)" <[email protected]>, Chris Mattmann
>>>> <[email protected]>
>>>> Cc: "Bryant, Ann C (398G-Affiliate)" <[email protected]>,
>>>> "Zimdars, Paul A (3980-Affiliate)" <[email protected]>,
>>>> NSF Polar CyberInfrastructure DR Students
>>>> <[email protected]>,
>>>> "[email protected]" <[email protected]>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi Professor,
>>>>> Please see attached jpg for the difference.
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>>>> [email protected]
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m
>>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app
>>>>> -m < /path/to/cbor/file.cbor any difference?
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke <[email protected]>
>>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>>> To: 'Luke' <[email protected]>, Chris Mattmann
>>>>> <[email protected]>, 'Giuseppe Totaro'
>>>>> <[email protected]>, Chris Mattmann
>>>>> <[email protected]>
>>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>,
>>>>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>,
>>>>> NSF Polar CyberInfrastructure DR Students
>>>>> <[email protected]>,
>>>>> <[email protected]>
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Hi professor,
>>>>>> I just sent a pull request for adding cbor extension.
>>>>>> The interesting thing is that tika is still identifying the file
>>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even
>>>>>> when I manually change the file extension to the correct one (i.e.
>*.cbor ).
>>>>>> 
>>>>>> The reason is probably that tika is identifying
>"application/xhtml+xml"
>>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by
>>>>>> xhtml+nutch,
>>>>>> you see that we do have that element as part of the cbor content
>>>>>> because the entire crawled xhtml document seems to be imbedded in
>>>>>> the cbor json(PFA:
>>>>>> cbor.jpg); and also in Tika, the magic detection seems to have
>>>>>> higher priority over the glob detection, thus the type is being
>>>>>> incorrectly detected.
>>>>>> 
>>>>>> Therefore, I would like to please mention that adding the entry of
>>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now
>>>>>> without some fixed magic bytes / patterns for cbor.
>>>>>> I also would like to add that the thing will be different with our
>>>>>> probabilistic mime detection selector, because if we know that the
>>>>>> file extension is more reliable than magic bytes, then we can
>>>>>> certainly add more preferential weight to the extension... this
>>>>>> also might show the current implementation with MimeTypes
>>>>>> detection is a bit stiff or less flexible in this scneario. :)
>>>>>> 
>>>>>> 
>>>>>> Thanks
>>>>>> Luke
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke [mailto:[email protected]]
>>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>>>>> '[email protected]'
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>> Yes, let me add the cbor extension entry in tika xml, will send
>>>>>> the pull request soon.
>>>>>> 
>>>>>> Thanks
>>>>>> Luke
>>>>>> -----Original Message-----
>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
>>>>>> [email protected]
>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>> 
>>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER
>>>>>> and tag along with adding an -extension command would be fantastic.
>>>>>> Can you file both of those NUTCH issues, wait a day or so, and
>>>>>> then based on feedback use your new Nutch commit karma to get
>>>>>> those into Nutch?
>>>>>> 
>>>>>> And then when creating the issues, can you link to the TIKA-1610
>issue?
>>>>>> At that point, when those two to be defined NUTCH issues are up,
>>>>>> Luke, in parallel can you throw up a pull request/patch in Tika
>>>>>> for the extension along with the MIME detection?
>>>>>> 
>>>>>> Cheers,
>>>>>> Chris
>>>>>> 
>>>>>> ------------------------
>>>>>> Chris Mattmann
>>>>>> [email protected]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Giuseppe Totaro <[email protected]>
>>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>>> To: Chris Mattmann <[email protected]>
>>>>>> Cc: Luke <[email protected]>, Chris Mattmann
>>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR
>>>>>> Students <[email protected]>,
>>>>>> "[email protected]"
>>>>>> <[email protected]>
>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks Luke. Great work.
>>>>>>> Chris, we wrap a single string value, representing the JSON text,
>>>>>>> for each file into CBOR (by using serializeCBORData method). For
>>>>>>> instance, using the Unix hex dump tool, we can see that, as
>>>>>>> expected, the first byte of all files is "0x7F" (the first three
>>>>>>> bits are "011", that is the major type for strings, and the
>>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the
>>>>>>> length of following text), and the following 4 bytes
>>>>>>> (single-precision
>>>>>>> float) encodes the right length of file (as described in RFC7049
>>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>>> Therefore, a CBOR tag is currently included into the file (a list
>>>>>>> of cbor tags is available here
>>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for
>>>>>>> this great research. Chris, if you agree, I can add support for
>>>>>>> prepending self-describing CBOR tag 55799 to
>>>>>>> CommonCrawldataDumper class. I believe it is very easy because I
>>>>>>> have to enable the WRITE_TYPE_HEADER feature for CBORGenerator
>>>>>>> class (the source code is available here
>>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master
>>>>>>> /s
>>>>>>> r
>>>>>>> c
>>>>>>> /
>>>>>>> m ain
>>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>>> Then, I can comment the TIKA-1610
>>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>>> 
>>>>>>> Regarding the file extension, in the Memex CCA format the
>>>>>>> original file extension is used. We could add support for a
>>>>>>> -extension command-line option allowing the user to give a file
>>>>>>> extension (e.g.,
>>>>>>> cbor) for all files dumped out.
>>>>>>> 
>>>>>>> Thanks a lot,
>>>>>>> Giuseppe
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980)
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> Thanks for this great research, Luke!
>>>>>>> 
>>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>>> 
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Chief Architect
>>>>>>> Instrument Software and Science Data Systems Section (398) NASA
>>>>>>> Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>>> Email: [email protected]
>>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <[email protected]>
>>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe
>>>>>>> U (3980-Affiliate)" <[email protected]>, Chris Mattmann
>>>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR
>>>>>>> Students <[email protected]>,
>>>>>>> "[email protected]"
>>>>>>> <[email protected]>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks professor.
>>>>>>>> Hi professor and all.
>>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>>> 
>>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>>> 
>>>>>>>> It looks like there is a self describing tag that needs to be
>>>>>>>> written in the cbor file thru which other applications might be
>>>>>>>> able to identify the cbor type....
>>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>>> 
>>>>>>>> I don't see that tag being present in the cbor file dumped by
>>>>>>>> the nutch tool, I am not very sure though.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C
>>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF
>>>>>>>> Polar CyberInfrastructure DR Students';
>>>>>>>> [email protected]
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> Nice one, Luke. If you have a second and you can open up an
>>>>>>>> issue in Tika to make it support CBOR, then yes, by all means!
>>>>>>>> :)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> [email protected]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke <[email protected]>
>>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>>> To: 'Giuseppe Totaro' <[email protected]>, Chris Mattmann
>>>>>>>> <[email protected]>, Chris Mattmann
>>>>>>>> <[email protected]>, "'Bryant, Ann C
>>>>>>>>(398G-Affiliate)'"
>>>>>>>> <[email protected]>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR
>>>>>>>> Students <[email protected]>,
>>>>>>>> <[email protected]>
>>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit
>>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper ,
>>>>>>>>>appreciated.
>>>>>>>>> 
>>>>>>>>> BTW, it looks like Tika might need to consider the support with
>>>>>>>>> COBR parser and detection.
>>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>>> PFA:
>>>>>>>>> rfc_cbor.jpg
>>>>>>>>> Actually, I don't quite understand why the
>>>>>>>>> CommonCrawlDataDumper is not dumping the nutch segments with
>>>>>>>>> the .cbor extension, which seems to be helpful for type
>>>>>>>>>detection.
>>>>>>>>> 
>>>>>>>>> To professor Mattmann,
>>>>>>>>> Tika does not support the detection of COBR, although the trunk
>>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor
>>>>>>>>> in the tika-mimetypes.xml, those entries are not detecting
>>>>>>>>> properly the cobr files dumped by CommonCrawlDataDumper.  Also
>>>>>>>>> CBOR does not have magic bytes, off the top of my head the only
>>>>>>>>> way we can detect it is using the extension, and content byte
>>>>>>>>> histogram (please note, this is a local optimal solution and
>>>>>>>>> data-dependent.)  J
>>>>>>>>> 
>>>>>>>>> I think I am bit deviating from the main route and discussion
>>>>>>>>> of this thread.... i.e. the plan for testing the "probabilistic
>>>>>>>>> mime detector selection" with polar data.
>>>>>>>>> Anyway, I plan to repackage tika by incorporating the
>>>>>>>>> probabilistic selection feature and replace the tika jar in
>>>>>>>>> nutch with the repackaged one, and then run the
>>>>>>>>> CommonCrawlDataDumper and see how it goes. If you have any
>>>>>>>>> specific ideas and thought with the testing, please kindly let me
>know.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Luke
>>>>>>>>> 
>>>>>>>>> From: Giuseppe Totaro [mailto:[email protected]]
>>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>>> To: Luke liu
>>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C
>>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF
>>>>>>>>> Polar CyberInfrastructure DR Students;
>>>>>>>>> [email protected]
>>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Luke,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> my name is Giuseppe and I am a PhD student working under the
>>>>>>>>> supervision of Prof. Chris Mattmann. I worked on
>>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a
>>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu
>>>>>>>>> <[email protected]> ha
>>>>>>>>> scritto:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was
>>>>>>>>> spending some time in understanding the code repo i.e.
>>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>>> 
>>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is
>>>>>>>>> dumping the crawl segments to json files with the human
>>>>>>>>> readable and understandable content.
>>>>>>>>> 1) I am trying to run one of the commands on my side as shown
>>>>>>>>> in gen-common-crawl.sh, but the generated files all end with
>>>>>>>>> .html or htm; The command listed in gen-common-crawl.sh seems
>>>>>>>>> to be allude to where the data is located on our
>>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org>
>>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are
>>>>>>>>> not exactly correct (probably they need to be updated), part of
>>>>>>>>> the patterns was able to allow me to locate some similar datasets
>(e.g.
>>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the
>>>>>>>>> dumped files are all ending with html, but surprisingly inside
>>>>>>>>> those outputted html files, the contents are present in json
>>>>>>>>> format;
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The file extension is (almost) always the same as the original
>file.
>>>>>>>>> More in detail, using the -epochFilename command-line option
>>>>>>>>> (as in gen-common-crawl.sh), the scraped data will be stored
>>>>>>>>> with a filename of the format
>>>>>>>>> <epochtime(milliseconds)>.<filetype>,
>>>>>>>>> where <filetype> is either the extension of the original file
>>>>>>>>> or .html as default if the original file does not have an
>extension.
>>>>>>>>> This schema is used for file naming and it does not depend on
>>>>>>>>> internal output format (JSON).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2) Another problem is that the root object is being set with
>>>>>>>>> some garbled chars in each of the outputted json files (with
>>>>>>>>> extension html in the end), PFA: garbled.jpg and one of the
>>>>>>>>> outputted json file has been also attached as an example too
>>>>>>>>>(PFA:
>>>>>>>>> 1423894754000.html); the json files cannot be parsed properly
>>>>>>>>> by aggregate.py due to those garbled chars.
>>>>>>>>> Even if I get rid of those garbled chars, there are not
>>>>>>>>> mimeTypes element which are being read by aggregate.py.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Text content and metadata extracted from the crawled binary
>>>>>>>>> data are stored in a structured document format (JSON).
>>>>>>>>> Furthermore, this document is encoded using CBOR
>>>>>>>>> <http://cbor.io/> serialization. Each not human-readable
>>>>>>>>> character that you notice in front and at the end of JSON data is
>due to CBOR-encoding.
>>>>>>>>> Thus, if you need to read JSON data from document dumped out by
>>>>>>>>> CommonCrawlDataDumper, you have to deserialized the
>>>>>>>>> CBOR-encoded data structure inside the file.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I hope this short overview can help in you work. I really
>>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your
>>>>>>>>> great job in detection.
>>>>>>>>> 
>>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>>> not hesitate to contact me if you may need any further 
>>>>>>>>>information.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Giuseppe
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>>> information (present in the readme of the code repo) is not 
>>>>>>>>> being collected and computed by aggregate.py from those output 
>>>>>>>>> json files but it looks like it is coming from the log.... see 
>>>>>>>>> the following as an example:
>>>>>>>>> 
>>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>>> TOTAL Stats:
>>>>>>>>> [
>>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>>> ]
>>>>>>>>> 
>>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>>> think I understand the whole idea and I do concur with it, 
>>>>>>>>> might be we can repackage the tika by incorporating the feature
>(i.e.
>>>>>>>>> probabilistic mime
>>>>>>>>> selection) in it and see if it can output the same information 
>>>>>>>>> as the one without it in the log.
>>>>>>>>> 
>>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>>> selection:
>>>>>>>>> in my pull request, I added a simple test case which might tell 
>>>>>>>>> a bit more about how the feature is called and used, it is 
>>>>>>>>> simple though.
>>>>>>>>> Here is an example snippet
>>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>>> information with this can be found in
>>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>>> mimeTypes::
>>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should 
>>>>>>>>> create another pull request with this change for
>>>>>>>>> TikaConfig)
>>>>>>>>> 
>>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>>> comment, please do kindly let me know, then I will start 
>>>>>>>>> working on my 'finale'. BTW, don't worry, even after I am 
>>>>>>>>> graduated, the graduation is not my termination with tika and 
>>>>>>>>> this project, after then I still can and want to help this 
>>>>>>>>> polar project and tika as much as possible, and correct the 
>>>>>>>>> programming faults and bugs, respond to the tika issues ,etc.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Luke
>>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>>> [email protected]
>>>>>>>>> Subject: Re: this week action from luke
>>>>>>>>> Importance: High
>>>>>>>>> 
>>>>>>>>> Awesome Luke. I am going to work specifically on now 
>>>>>>>>> benchmarking your code in real situations. For example, it 
>>>>>>>>> would be fantastic to now run your Bayesian MIME detector over 
>>>>>>>>> the whole NSF TREC Dynamic Domain data for Polar described here:
>>>>>>>>> 
>>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>>> 
>>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, 
>>>>>>>>> and Annie can explain it, also CC'ed.
>>>>>>>>> 
>>>>>>>>> Can we make that your goal for the next 2 weeks to actually 
>>>>>>>>> test it and produce a real result over the whole TREC-DD data 
>>>>>>>>> for Polar? My goal will be to get your code committed and 
>>>>>>>>> integrated into Tika.
>>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>>> 
>>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>>> to try and test more accurate file identification with Tika, 
>>>>>>>>> Luke is the guy to talk to and I have him for 2 more weeks.
>>>>>>>>> 
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Chris
>>>>>>>>> 
>>>>>>>>> ------------------------
>>>>>>>>> Chris Mattmann
>>>>>>>>> [email protected]
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Luke liu <[email protected]>
>>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>>> To: Chris Mattmann <[email protected]>, Chris Mattmann 
>>>>>>>>> <[email protected]>
>>>>>>>>> Cc: 'Luke' <[email protected]>
>>>>>>>>> Subject: this week action from luke
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Professor Mattmann,
>>>>>>>>> 
>>>>>>>>> I think I am in the final phase of the research, and last week 
>>>>>>>>> I finished the last item in the list, and hopefully everything 
>>>>>>>>> will be fine.
>>>>>>>>> 
>>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>>> done...and it will be also great if you can please comment on 
>>>>>>>>> my work (the 2 pull
>>>>>>>>> requests) when you have time.
>>>>>>>>> 
>>>>>>>>> If you do have confusion with any of my work, please also do 
>>>>>>>>> let me know.
>>>>>>>>> 
>>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>>> )when I have time.
>>>>>>>>> 
>>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>>> important, please do let me know too.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Luke
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>> it, send an email to [email protected]
>>>>>>>>> <mailto:memex-jpl%[email protected]>.
>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351
>>>>>>>>> 00
>>>>>>>>> 7
>>>>>>>>> 0
>>>>>>>>> %
>>>>>>>>> 2
>>>>>>>>> 41
>>>>>>>>> 9f3
>>>>>>>>> 0150%24%40edu.
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>> <garbled.jpg><1423894754000.html>
>> 
>> 
>

Re: [memex-jpl] this week action from luke

Reply via email to