Great work Luke and both of these changes make sense. Please send the pull request for that thank you!
Great work Giuseppe! Go team! Cheers, Chris ------------------------ Chris Mattmann chris.mattm...@gmail.com -----Original Message----- From: Luke <hanson311...@gmail.com> Date: Thursday, April 23, 2015 at 3:08 AM To: 'Luke' <hanson311...@gmail.com>, Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>, Chris Mattmann <chris.mattm...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'" <tot...@di.uniroma1.it>, <dev@tika.apache.org>, "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR Students <nsf-polar-usc-stude...@googlegroups.com>, <memex-...@googlegroups.com> Subject: RE: [memex-jpl] this week action from luke >Both patches from Guiseppe all works based on my tests; from the tests I >was able to see the magic tag was being appended at the beginning of the >file, and the cbor extension was being appended too when running the Nutch >dump tool command with the "-extension cbor" option. Thanks a lot for the >kind help, Giuseppe, highly appreciated. I want to please give a big thumb >up to Guiseppe's work, it is thorough and considerate too. > >To professor, >with Guiseppe's two patches, we still need to make a bit change in Tika >mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika >as >it does not look very common, even if it accidentally appears in some >other >type of files, tika will have extension and metadatahint as a fallback >strategy). I am going to send another pull request with that change; >But before that, it will be great to elaborate what I am going to change >to >avoid any confusion. > >Now we have two problems. >Problem1: Magic priority 40. > The application/xhtml+xml has higher priority(50) than >application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; >So >if xhtml gets read and compared first, cbor will not even be placed in >the >magic estimation list because it has low priority. Based on the tests, it >turns out that it is true that xhtml gets read and compared first with the >input file, so any type below the priority 50 will be disregarded. > > >Problem2: again magic priority with 50. > In Tika, given a file dumped by the nutch dumper tool, both types >(xhtml and cbor) will be selected as candidate mime types and they will be >put in the magic estimation list; since xhtml type gets read first, it is >placed atop the cbor; in order to break that tie, tika will rely on the >decision from the extension method. If the extension method fails to >detect >the type(for now, let's ignore metadata hint method for simplicity but the >same applies to it too), then xhtml gets returned eventually. > >My pull request to be sent : I am going to set the magic priority of cbor >type to 50 the same as xhtml, because it would probably be risky to >discard >any one of the estimated types without going consult the extension method. > >Any comments, suggestion, thoughts will be welcomed and appreciated. > >Thanks >Luke > >-----Original Message----- >From: Luke [mailto:hanson311...@gmail.com] >Sent: Wednesday, April 22, 2015 7:45 PM >To: 'Mattmann, Chris A (3980)' >Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)'; >'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >'memex-...@googlegroups.com' >Subject: RE: [memex-jpl] this week action from luke > >Hi Prof, > >The test was finished, the result is expected. >Both (tika with the prob feature and the one without it) produced the same >"stats total", please see the attached matched.txt dumped by the small >program that verbatim checks and compares each line in every section of >the >"Stats total" between the log produced by the tika that has the feature >and >the one without it; so if the string.equals(...) satisfies, the string of >the line will be dumped out. If there is a mismatch(e.g. the count for a >particular mime type is different), an error will be dumped out. >Eventually, >I don't see any error in the printout, I think the feature seem to have >passed the test. > > >The processing time between 2 tests is as follows. >The following shows the start time and end time for the test where the >Nutch >dumper tool with the prob selection feature. >from >2015-04-22 15:47:08,330 >to >2015-04-22 17:48:28,877 > >The following shows the start time and end time for the test where the >Nutch >dumper tool without the tika with the feature. >from >2015-04-22 22:41:23,459 >to >2015-04-23 00:11:02,767 > > >BTW, I forgot to mention that probabilistic mime selector with default >weight settings also gives the following result, because by default I >intentionally assign \ a higher weight value on the magic bytes method so >as >to make it work in a way similar to the old strategy. On the other hands, >if >I know that extension is more reliable, I can certainly add more weights >to >the extension approach, in this case, the prob mime selector will return >application/cbor with a higher value of weight. > >> <match value="<html xmlns=" type="string" offset="0:1024"/> >> Result: "text/html" >> >> <match value="<html xmlns=" type="string" offset="0:6000"/> >> Result: "application/xhtml+xml" > > >Please kindly let me know if you have any confusion with the tests; > > >Thanks >Luke > >-----Original Message----- >From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] >Sent: Wednesday, April 22, 2015 3:49 PM >To: Luke >Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate); >dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A >(3980-Affiliate); NSF Polar CyberInfrastructure DR Students; >memex-...@googlegroups.com >Subject: Re: [memex-jpl] this week action from luke > >Thanks Luke this is probably a good opportunity to test out your Bayesian >mime detector how does it perform here? > >Sent from my iPhone > >> On Apr 22, 2015, at 3:29 PM, Luke <hanson311...@gmail.com> wrote: >> >> Hi professor, >> >> Please see the following results. >> <match value="<html xmlns=" type="string" offset="0:1024"/> >> Result: "text/html" >> >> <match value="<html xmlns=" type="string" offset="0:6000"/> >> Result: "application/xhtml+xml" >> >> >> Thanks >> Luke >> >> -----Original Message----- >> From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >> Sent: Wednesday, April 22, 2015 4:21 AM >> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U >> (3980-Affiliate)'; dev@tika.apache.org >> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >> memex-...@googlegroups.com >> Subject: Re: [memex-jpl] this week action from luke >> >> Hi Luke, >> >> Actually I just meant go into tika-mimetypes.xml and change the magic >offsets for application/xhtml+xml and see if that works. The code you >changed below is actually how many bytes Tika will first download to do >MIME >checking. >> >> Cheers, >> Chris >> >> ------------------------ >> Chris Mattmann >> chris.mattm...@gmail.com >> >> >> >> >> -----Original Message----- >> From: Luke <hanson311...@gmail.com> >> Date: Wednesday, April 22, 2015 at 2:25 AM >> To: Chris Mattmann <chris.mattm...@gmail.com>, Chris Mattmann ><chris.a.mattm...@jpl.nasa.gov>, "'Totaro, Giuseppe U (3980-Affiliate)'" >> <tot...@di.uniroma1.it>, <dev@tika.apache.org> >> Cc: "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>, >> "'Zimdars, Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, >> NSF Polar CyberInfrastructure DR Students >> <nsf-polar-usc-stude...@googlegroups.com>, >> <memex-...@googlegroups.com> >> Subject: RE: [memex-jpl] this week action from luke >> >>> >>> Hi professor, >>> >>> I just tried it with minLength set to 1024, I get the following >>> "text/plain" >>> I am a bit surprised.... >>> >>> BTW, the 6000 min length still give "application/xhtml+xml"; with >>> anything below 1024 min length, I am seeing "text/plain". :) >>> >>> BTW, the min length I am referring/altering is as follows >>> MimeTypes.java >>> public int getMinLength() { >>> // This needs to be reasonably large to be able to correctly >>> detect >>> // things like XML root elements after initial comment and DTDs >>> return 64 * 1024; >>> } >>> >>> >>> Thanks >>> Luke >>> >>> -----Original Message----- >>> From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>> Sent: Tuesday, April 21, 2015 7:48 PM >>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U >>> (3980-Affiliate)'; dev@tika.apache.org >>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>> memex-...@googlegroups.com >>> Subject: Re: [memex-jpl] this week action from luke >>> >>> Thanks Luke. >>> >>> So I guess all I was asking was could you try it out. Thanks for the >>> lesson in the RFC. >>> >>> Cheers, >>> Chris >>> >>> ------------------------ >>> Chris Mattmann >>> chris.mattm...@gmail.com >>> >>> >>> >>> >>> -----Original Message----- >>> From: Luke <hanson311...@gmail.com> >>> Date: Wednesday, April 22, 2015 at 1:46 AM >>> To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>, Chris Mattmann >>> <chris.mattm...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'" >>> <tot...@di.uniroma1.it>, <dev@tika.apache.org> >>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>, >>> "'Zimdars, Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, >>> NSF Polar CyberInfrastructure DR Students >>> <nsf-polar-usc-stude...@googlegroups.com>, >>> <memex-...@googlegroups.com> >>> Subject: RE: [memex-jpl] this week action from luke >>> >>>> Hi professor, >>>> >>>> >>>> I think it highly depends on the content being read by tika, e.g. if >>>> there is a sequence of bytes in the file that is being read and is >>>> the same as one or more of mime types being defined in our >>>> tika-mimes.xml, I guess that tika will put those types in its >>>> estimation list, please note there could be multiple estimated mime >>>> types by magic-byte detection approach. Now tika also considers the >>>> decision made by extension detection approach, if extension says the >>>> file type it believes is the first one in the magic type estimation >>>> list, then certainly the first one will be returned. (the same >>>> applies to metadata hint approach); Of course, tika also prefers the >>>> type that is the most specialized. >>>> >>>> let's get back to the following question, here is my guess though. >>>> [Prof]: Also what happens if you tweak the definition of XHTML to >>>> not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over >then? >>>> Let's consider an extreme case where we only scan 10 or 1 bytes, >>>> then it seems that magic bytes will inevitable detect nothing, and I >>>> think it will return the something like" application/oct-stream" >>>> that is the most general type. As mentioned, tika favours the one >>>> that is the most specialized, if extension approach returns the one >>>> that is more specialized, in this extreme case I believe almost >>>> every type is a subclass of this "application/oct-stream".... >>>> therefore the answer in this extreme may be yes, I think it is very >>>> possible that CBOR type detected by the extension approach takes over >>>>in >this case... >>>> >>>> My idea was and still is that if the cbor self-Describing tag 55799 >>>> is present in the cbor file, then that can be used to detect the cbor >type. >>>> Again, the cbor type will probably be appended into the magic >>>> estimation list together with another one such as application/html, >>>> I guess the order in the list probably also matters, the first one >>>> is preferred over the next one. Also the decision from the extension >>>> detection approach also play the role the break the tie. >>>> e.g. if extension detection method agrees on cbor with one of the >>>> estimated type in the magic list, then cbor will be returned. >>>> (again, same thing applies to metadatahint method). >>>> >>>> I have not taken a closer look at a cbor file that has the tag >>>> 55799, but I expect to see its hex is something like 0xd9d9f7 or the >>>> tag should be present in the header with a fixed sequence of >>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this >>>> is present in the file or preferable in the header (within a >>>> reasonable range of bytes ), I believe it can probably be used as >>>> the magic numbers for the cbor type. >>>> >>>> >>>> There is another thing I have mentioned in the jira ticket I opened >>>> yesterday against the cbor parser and detection, it is also possible >>>> that cbor content can be imbedded inside a plain json file, the way >>>> that a decoder can distinguish them in that file is by looking at >>>> the tag 55799 again. This may rarely happen but a robust parser >>>> might be able to take care of that, tika might need to consider the >>>> use of fastXML being used by the nutch tool when developing the cbor >parser... >>>> Again let me cite the same paragraph from the rfc, >>>> >>>> " a decoder might be able to parse both CBOR and JSON. >>>> Such a decoder would need to mechanically distinguish the two >>>> formats. An easy way for an encoder to help the decoder would be to >>>> tag the entire CBOR item with tag 55799, the serialization of which >>>> will never be found at the beginning of a JSON text." >>>> >>>> >>>> Thanks >>>> Luke >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Mattmann, Chris A (3980) >>>> [mailto:chris.a.mattm...@jpl.nasa.gov] >>>> Sent: Tuesday, April 21, 2015 9:49 PM >>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate) >>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A >>>> (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; >>>> memex-...@googlegroups.com >>>> Subject: Re: [memex-jpl] this week action from luke >>>> >>>> Hi Luke, >>>> >>>> Can you post the below conversation to dev@tika and summarize it >>>>there. >>>> Also what happens if you tweak the definition of XHTML to not scan >>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) NASA Jet >>>> Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: chris.a.mattm...@nasa.gov >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department University >>>> of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Luke <hanson311...@gmail.com> >>>> Date: Wednesday, April 22, 2015 at 12:19 AM >>>> To: Chris Mattmann <chris.mattm...@gmail.com>, "Totaro, Giuseppe U >>>> (3980-Affiliate)" <tot...@di.uniroma1.it>, Chris Mattmann >>>> <chris.a.mattm...@jpl.nasa.gov> >>>> Cc: "Bryant, Ann C (398G-Affiliate)" <anniebry...@gmail.com>, >>>> "Zimdars, Paul A (3980-Affiliate)" <paul.a.zimd...@jpl.nasa.gov>, >>>> NSF Polar CyberInfrastructure DR Students >>>> <nsf-polar-usc-stude...@googlegroups.com>, >>>> "memex-...@googlegroups.com" <memex-...@googlegroups.com> >>>> Subject: RE: [memex-jpl] this week action from luke >>>> >>>>> Hi Professor, >>>>> Please see attached jpg for the difference. >>>>> Thanks >>>>> Luke >>>>> >>>>> -----Original Message----- >>>>> From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>>>> Sent: Tuesday, April 21, 2015 5:27 PM >>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)' >>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>>>> memex-...@googlegroups.com >>>>> Subject: Re: [memex-jpl] this week action from luke >>>>> >>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m >>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app >>>>> -m < /path/to/cbor/file.cbor any difference? >>>>> >>>>> ------------------------ >>>>> Chris Mattmann >>>>> chris.mattm...@gmail.com >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Luke <hanson311...@gmail.com> >>>>> Date: Tuesday, April 21, 2015 at 5:41 PM >>>>> To: 'Luke' <hanson311...@gmail.com>, Chris Mattmann >>>>> <chris.mattm...@gmail.com>, 'Giuseppe Totaro' >>>>> <tot...@di.uniroma1.it>, Chris Mattmann >>>>> <chris.a.mattm...@jpl.nasa.gov> >>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>, >>>>> "'Zimdars, Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, >>>>> NSF Polar CyberInfrastructure DR Students >>>>> <nsf-polar-usc-stude...@googlegroups.com>, >>>>> <memex-...@googlegroups.com> >>>>> Subject: RE: [memex-jpl] this week action from luke >>>>> >>>>>> Hi professor, >>>>>> I just sent a pull request for adding cbor extension. >>>>>> The interesting thing is that tika is still identifying the file >>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even >>>>>> when I manually change the file extension to the correct one (i.e. >*.cbor ). >>>>>> >>>>>> The reason is probably that tika is identifying >"application/xhtml+xml" >>>>>> by searching for the "<html" in the file content, PFA: >>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by >>>>>> xhtml+nutch, >>>>>> you see that we do have that element as part of the cbor content >>>>>> because the entire crawled xhtml document seems to be imbedded in >>>>>> the cbor json(PFA: >>>>>> cbor.jpg); and also in Tika, the magic detection seems to have >>>>>> higher priority over the glob detection, thus the type is being >>>>>> incorrectly detected. >>>>>> >>>>>> Therefore, I would like to please mention that adding the entry of >>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now >>>>>> without some fixed magic bytes / patterns for cbor. >>>>>> I also would like to add that the thing will be different with our >>>>>> probabilistic mime detection selector, because if we know that the >>>>>> file extension is more reliable than magic bytes, then we can >>>>>> certainly add more preferential weight to the extension... this >>>>>> also might show the current implementation with MimeTypes >>>>>> detection is a bit stiff or less flexible in this scneario. :) >>>>>> >>>>>> >>>>>> Thanks >>>>>> Luke >>>>>> >>>>>> -----Original Message----- >>>>>> From: Luke [mailto:hanson311...@gmail.com] >>>>>> Sent: Tuesday, April 21, 2015 12:14 PM >>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)' >>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>>>>> 'memex-...@googlegroups.com' >>>>>> Subject: RE: [memex-jpl] this week action from luke >>>>>> >>>>>> Yes, let me add the cbor extension entry in tika xml, will send >>>>>> the pull request soon. >>>>>> >>>>>> Thanks >>>>>> Luke >>>>>> -----Original Message----- >>>>>> From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>>>>> Sent: Tuesday, April 21, 2015 6:51 AM >>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980) >>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A >>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; >>>>>> memex-...@googlegroups.com >>>>>> Subject: Re: [memex-jpl] this week action from luke >>>>>> >>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER >>>>>> and tag along with adding an -extension command would be fantastic. >>>>>> Can you file both of those NUTCH issues, wait a day or so, and >>>>>> then based on feedback use your new Nutch commit karma to get >>>>>> those into Nutch? >>>>>> >>>>>> And then when creating the issues, can you link to the TIKA-1610 >issue? >>>>>> At that point, when those two to be defined NUTCH issues are up, >>>>>> Luke, in parallel can you throw up a pull request/patch in Tika >>>>>> for the extension along with the MIME detection? >>>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>> ------------------------ >>>>>> Chris Mattmann >>>>>> chris.mattm...@gmail.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Giuseppe Totaro <tot...@di.uniroma1.it> >>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM >>>>>> To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov> >>>>>> Cc: Luke <hanson311...@gmail.com>, Chris Mattmann >>>>>> <chris.mattm...@gmail.com>, "Bryant, Ann C (398G-Affiliate)" >>>>>> <anniebry...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)" >>>>>> <paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR >>>>>> Students <nsf-polar-usc-stude...@googlegroups.com>, >>>>>> "memex-...@googlegroups.com" >>>>>> <memex-...@googlegroups.com> >>>>>> Subject: Re: [memex-jpl] this week action from luke >>>>>> >>>>>>> Thanks Luke. Great work. >>>>>>> Chris, we wrap a single string value, representing the JSON text, >>>>>>> for each file into CBOR (by using serializeCBORData method). For >>>>>>> instance, using the Unix hex dump tool, we can see that, as >>>>>>> expected, the first byte of all files is "0x7F" (the first three >>>>>>> bits are "011", that is the major type for strings, and the >>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the >>>>>>> length of following text), and the following 4 bytes >>>>>>> (single-precision >>>>>>> float) encodes the right length of file (as described in RFC7049 >>>>>>> <http://tools.ietf.org/html/rfc7049>). >>>>>>> Therefore, a CBOR tag is currently included into the file (a list >>>>>>> of cbor tags is available here >>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>). >>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for >>>>>>> this great research. Chris, if you agree, I can add support for >>>>>>> prepending self-describing CBOR tag 55799 to >>>>>>> CommonCrawldataDumper class. I believe it is very easy because I >>>>>>> have to enable the WRITE_TYPE_HEADER feature for CBORGenerator >>>>>>> class (the source code is available here >>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master >>>>>>> /s >>>>>>> r >>>>>>> c >>>>>>> / >>>>>>> m ain >>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>). >>>>>>> Then, I can comment the TIKA-1610 >>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue. >>>>>>> >>>>>>> Regarding the file extension, in the Memex CCA format the >>>>>>> original file extension is used. We could add support for a >>>>>>> -extension command-line option allowing the user to give a file >>>>>>> extension (e.g., >>>>>>> cbor) for all files dumped out. >>>>>>> >>>>>>> Thanks a lot, >>>>>>> Giuseppe >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) >>>>>>> <chris.a.mattm...@jpl.nasa.gov> wrote: >>>>>>> >>>>>>> Thanks for this great research, Luke! >>>>>>> >>>>>>> Giuseppe, any idea why this tag doesn't make it into the file? >>>>>>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Chris Mattmann, Ph.D. >>>>>>> Chief Architect >>>>>>> Instrument Software and Science Data Systems Section (398) NASA >>>>>>> Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>> Office: 168-519, Mailstop: 168-527 >>>>>>> Email: chris.a.mattm...@nasa.gov >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Adjunct Associate Professor, Computer Science Department >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Luke <hanson311...@gmail.com> >>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM >>>>>>> To: Chris Mattmann <chris.mattm...@gmail.com>, "Totaro, Giuseppe >>>>>>> U (3980-Affiliate)" <tot...@di.uniroma1.it>, Chris Mattmann >>>>>>> <chris.a.mattm...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)" >>>>>>> <anniebry...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)" >>>>>>> <paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR >>>>>>> Students <nsf-polar-usc-stude...@googlegroups.com>, >>>>>>> "memex-...@googlegroups.com" >>>>>>> <memex-...@googlegroups.com> >>>>>>> Subject: RE: [memex-jpl] this week action from luke >>>>>>> >>>>>>>> Thanks professor. >>>>>>>> Hi professor and all. >>>>>>>> JIRA issue : CBOR Parser and detection improvement >>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610 >>>>>>>> >>>>>>>> I tried to conduct a bit research with this cbor detection. >>>>>>>> >>>>>>>> It looks like there is a self describing tag that needs to be >>>>>>>> written in the cbor file thru which other applications might be >>>>>>>> able to identify the cbor type.... >>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5 >>>>>>>> >>>>>>>> I don't see that tag being present in the cbor file dumped by >>>>>>>> the nutch tool, I am not very sure though. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Luke >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>>>>>>> Sent: Monday, April 20, 2015 4:10 AM >>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C >>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF >>>>>>>> Polar CyberInfrastructure DR Students'; >>>>>>>> memex-...@googlegroups.com >>>>>>>> Subject: Re: [memex-jpl] this week action from luke >>>>>>>> >>>>>>>> Nice one, Luke. If you have a second and you can open up an >>>>>>>> issue in Tika to make it support CBOR, then yes, by all means! >>>>>>>> :) >>>>>>>> >>>>>>>> >>>>>>>> ------------------------ >>>>>>>> Chris Mattmann >>>>>>>> chris.mattm...@gmail.com >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Luke <hanson311...@gmail.com> >>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM >>>>>>>> To: 'Giuseppe Totaro' <tot...@di.uniroma1.it>, Chris Mattmann >>>>>>>> <chris.mattm...@gmail.com>, Chris Mattmann >>>>>>>> <chris.a.mattm...@jpl.nasa.gov>, "'Bryant, Ann C >>>>>>>>(398G-Affiliate)'" >>>>>>>> <anniebry...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'" >>>>>>>> <paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR >>>>>>>> Students <nsf-polar-usc-stude...@googlegroups.com>, >>>>>>>> <memex-...@googlegroups.com> >>>>>>>> Subject: RE: [memex-jpl] this week action from luke >>>>>>>> >>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit >>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , >>>>>>>>>appreciated. >>>>>>>>> >>>>>>>>> BTW, it looks like Tika might need to consider the support with >>>>>>>>> COBR parser and detection. >>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers. >>>>>>>>> PFA: >>>>>>>>> rfc_cbor.jpg >>>>>>>>> Actually, I don't quite understand why the >>>>>>>>> CommonCrawlDataDumper is not dumping the nutch segments with >>>>>>>>> the .cbor extension, which seems to be helpful for type >>>>>>>>>detection. >>>>>>>>> >>>>>>>>> To professor Mattmann, >>>>>>>>> Tika does not support the detection of COBR, although the trunk >>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor >>>>>>>>> in the tika-mimetypes.xml, those entries are not detecting >>>>>>>>> properly the cobr files dumped by CommonCrawlDataDumper. Also >>>>>>>>> CBOR does not have magic bytes, off the top of my head the only >>>>>>>>> way we can detect it is using the extension, and content byte >>>>>>>>> histogram (please note, this is a local optimal solution and >>>>>>>>> data-dependent.) J >>>>>>>>> >>>>>>>>> I think I am bit deviating from the main route and discussion >>>>>>>>> of this thread.... i.e. the plan for testing the "probabilistic >>>>>>>>> mime detector selection" with polar data. >>>>>>>>> Anyway, I plan to repackage tika by incorporating the >>>>>>>>> probabilistic selection feature and replace the tika jar in >>>>>>>>> nutch with the repackaged one, and then run the >>>>>>>>> CommonCrawlDataDumper and see how it goes. If you have any >>>>>>>>> specific ideas and thought with the testing, please kindly let me >know. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Luke >>>>>>>>> >>>>>>>>> From: Giuseppe Totaro [mailto:tot...@di.uniroma1.it] >>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM >>>>>>>>> To: Luke liu >>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C >>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF >>>>>>>>> Polar CyberInfrastructure DR Students; >>>>>>>>> memex-...@googlegroups.com >>>>>>>>> Subject: Re: [memex-jpl] this week action from luke >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Luke, >>>>>>>>> >>>>>>>>> >>>>>>>>> my name is Giuseppe and I am a PhD student working under the >>>>>>>>> supervision of Prof. Chris Mattmann. I worked on >>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a >>>>>>>>> couple of your observations. My comments inline below. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu >>>>>>>>> <shuai...@usc.edu> ha >>>>>>>>> scritto: >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was >>>>>>>>> spending some time in understanding the code repo i.e. >>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/ >>>>>>>>> >>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is >>>>>>>>> dumping the crawl segments to json files with the human >>>>>>>>> readable and understandable content. >>>>>>>>> 1) I am trying to run one of the commands on my side as shown >>>>>>>>> in gen-common-crawl.sh, but the generated files all end with >>>>>>>>> .html or htm; The command listed in gen-common-crawl.sh seems >>>>>>>>> to be allude to where the data is located on our >>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> >>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are >>>>>>>>> not exactly correct (probably they need to be updated), part of >>>>>>>>> the patterns was able to allow me to locate some similar datasets >(e.g. >>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the >>>>>>>>> dumped files are all ending with html, but surprisingly inside >>>>>>>>> those outputted html files, the contents are present in json >>>>>>>>> format; >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> The file extension is (almost) always the same as the original >file. >>>>>>>>> More in detail, using the -epochFilename command-line option >>>>>>>>> (as in gen-common-crawl.sh), the scraped data will be stored >>>>>>>>> with a filename of the format >>>>>>>>> <epochtime(milliseconds)>.<filetype>, >>>>>>>>> where <filetype> is either the extension of the original file >>>>>>>>> or .html as default if the original file does not have an >extension. >>>>>>>>> This schema is used for file naming and it does not depend on >>>>>>>>> internal output format (JSON). >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 2) Another problem is that the root object is being set with >>>>>>>>> some garbled chars in each of the outputted json files (with >>>>>>>>> extension html in the end), PFA: garbled.jpg and one of the >>>>>>>>> outputted json file has been also attached as an example too >>>>>>>>>(PFA: >>>>>>>>> 1423894754000.html); the json files cannot be parsed properly >>>>>>>>> by aggregate.py due to those garbled chars. >>>>>>>>> Even if I get rid of those garbled chars, there are not >>>>>>>>> mimeTypes element which are being read by aggregate.py. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Text content and metadata extracted from the crawled binary >>>>>>>>> data are stored in a structured document format (JSON). >>>>>>>>> Furthermore, this document is encoded using CBOR >>>>>>>>> <http://cbor.io/> serialization. Each not human-readable >>>>>>>>> character that you notice in front and at the end of JSON data is >due to CBOR-encoding. >>>>>>>>> Thus, if you need to read JSON data from document dumped out by >>>>>>>>> CommonCrawlDataDumper, you have to deserialized the >>>>>>>>> CBOR-encoded data structure inside the file. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I hope this short overview can help in you work. I really >>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your >>>>>>>>> great job in detection. >>>>>>>>> >>>>>>>>> I am available to provide you all support I can give, so you do >>>>>>>>> not hesitate to contact me if you may need any further >>>>>>>>>information. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Giuseppe >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Finally, after some research, I guess that the statistical >>>>>>>>> information (present in the readme of the code repo) is not >>>>>>>>> being collected and computed by aggregate.py from those output >>>>>>>>> json files but it looks like it is coming from the log.... see >>>>>>>>> the following as an example: >>>>>>>>> >>>>>>>>> 2015-04-19 04:55:42,078 INFO tools.CommonCrawlDataDumper - >>>>>>>>> CommonsCrawlDataDumper File Stats: >>>>>>>>> TOTAL Stats: >>>>>>>>> [ >>>>>>>>> {"mimeType":"application/x-tika-msoffice","count":"17"} >>>>>>>>> {"mimeType":"application/vnd.ms-excel","count":"7"} >>>>>>>>> {"mimeType":"application/xhtml+xml","count":"3000"} >>>>>>>>> {"mimeType":"application/octet-stream","count":"641"} >>>>>>>>> {"mimeType":"application/epub+zip","count":"1"} >>>>>>>>> {"mimeType":"application/zip","count":"6"} >>>>>>>>> {"mimeType":"application/xml","count":"11"} >>>>>>>>> {"mimeType":"image/png","count":"110"} >>>>>>>>> {"mimeType":"image/jpeg","count":"70"} >>>>>>>>> {"mimeType":"application/atom+xml","count":"213"} >>>>>>>>> {"mimeType":"application/rss+xml","count":"43"} >>>>>>>>> {"mimeType":"video/mp4","count":"3"} >>>>>>>>> {"mimeType":"text/plain","count":"104"} >>>>>>>>> {"mimeType":"application/rdf+xml","count":"2"} >>>>>>>>> {"mimeType":"image/gif","count":"2"} >>>>>>>>> {"mimeType":"text/x-php","count":"1"} >>>>>>>>> {"mimeType":"video/x-msvideo","count":"1"} >>>>>>>>> {"mimeType":"application/x-tika-ooxml","count":"3"} >>>>>>>>> {"mimeType":"text/html","count":"9506"} >>>>>>>>> {"mimeType":"application/pdf","count":"280"} >>>>>>>>> ] >>>>>>>>> >>>>>>>>> It turns out that aggregate.py is not the one that produces the >>>>>>>>> statistical information, not sure what it does... but anyway, I >>>>>>>>> think I understand the whole idea and I do concur with it, >>>>>>>>> might be we can repackage the tika by incorporating the feature >(i.e. >>>>>>>>> probabilistic mime >>>>>>>>> selection) in it and see if it can output the same information >>>>>>>>> as the one without it in the log. >>>>>>>>> >>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime >>>>>>>>> selection: >>>>>>>>> in my pull request, I added a simple test case which might tell >>>>>>>>> a bit more about how the feature is called and used, it is >>>>>>>>> simple though. >>>>>>>>> Here is an example snippet >>>>>>>>> ProbabilisticMimeDetectionSelector probSel = new >>>>>>>>> ProbabilisticMimeDetectionSelector(); >>>>>>>>> probSel.detect(input::InputStream, metadata:: >>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more >>>>>>>>> information with this can be found in >>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517) >>>>>>>>> Now, in order to allow the Tika().detect() to call the >>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as >>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to >>>>>>>>> modify/add some code in the TikaConfig which initializes a list >>>>>>>>> of default detectors, and we need to get rid of the detector - >>>>>>>>> mimeTypes:: >>>>>>>>> MimeTypes in the list and replace it with probSel:: >>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should >>>>>>>>> create another pull request with this change for >>>>>>>>> TikaConfig) >>>>>>>>> >>>>>>>>> I think that is all of my initial thought with some finding and >>>>>>>>> plan; if you have anything you would like to please add and >>>>>>>>> comment, please do kindly let me know, then I will start >>>>>>>>> working on my 'finale'. BTW, don't worry, even after I am >>>>>>>>> graduated, the graduation is not my termination with tika and >>>>>>>>> this project, after then I still can and want to help this >>>>>>>>> polar project and tika as much as possible, and correct the >>>>>>>>> programming faults and bugs, respond to the tika issues ,etc. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Luke >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM >>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C >>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate) >>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; >>>>>>>>> memex-...@googlegroups.com >>>>>>>>> Subject: Re: this week action from luke >>>>>>>>> Importance: High >>>>>>>>> >>>>>>>>> Awesome Luke. I am going to work specifically on now >>>>>>>>> benchmarking your code in real situations. For example, it >>>>>>>>> would be fantastic to now run your Bayesian MIME detector over >>>>>>>>> the whole NSF TREC Dynamic Domain data for Polar described here: >>>>>>>>> >>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/ >>>>>>>>> >>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, >>>>>>>>> and Annie can explain it, also CC'ed. >>>>>>>>> >>>>>>>>> Can we make that your goal for the next 2 weeks to actually >>>>>>>>> test it and produce a real result over the whole TREC-DD data >>>>>>>>> for Polar? My goal will be to get your code committed and >>>>>>>>> integrated into Tika. >>>>>>>>> The more you can write me a guide of how to build and test your >>>>>>>>> code with Tika so I can get it committed the better. >>>>>>>>> >>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is >>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's >>>>>>>>> existing MIME detection approach. If folks have any Memex needs >>>>>>>>> to try and test more accurate file identification with Tika, >>>>>>>>> Luke is the guy to talk to and I have him for 2 more weeks. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Chris >>>>>>>>> >>>>>>>>> ------------------------ >>>>>>>>> Chris Mattmann >>>>>>>>> chris.mattm...@gmail.com >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Luke liu <shuai...@usc.edu> >>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM >>>>>>>>> To: Chris Mattmann <chris.mattm...@gmail.com>, Chris Mattmann >>>>>>>>> <chris.a.mattm...@jpl.nasa.gov> >>>>>>>>> Cc: 'Luke' <hanson311...@gmail.com> >>>>>>>>> Subject: this week action from luke >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Professor Mattmann, >>>>>>>>> >>>>>>>>> I think I am in the final phase of the research, and last week >>>>>>>>> I finished the last item in the list, and hopefully everything >>>>>>>>> will be fine. >>>>>>>>> >>>>>>>>> For now, i probably can spend some time in verifying or >>>>>>>>> optimizing the codes, the majority of the research has been >>>>>>>>> done...and it will be also great if you can please comment on >>>>>>>>> my work (the 2 pull >>>>>>>>> requests) when you have time. >>>>>>>>> >>>>>>>>> If you do have confusion with any of my work, please also do >>>>>>>>> let me know. >>>>>>>>> >>>>>>>>> Thanks and I am glad working with you, for the next a couple of >>>>>>>>> weeks before graduation, I am going to continue revising and >>>>>>>>> testing the code and features to get rid of some flaws (if any >>>>>>>>> )when I have time. >>>>>>>>> >>>>>>>>> Not sure if I miss out something, and if I do miss some thing >>>>>>>>> important, please do let me know too. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Luke >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the >>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>> it, send an email to memex-jpl+unsubscr...@googlegroups.com >>>>>>>>> <mailto:memex-jpl%2bunsubscr...@googlegroups.com>. >>>>>>>>> To post to this group, send email to memex-...@googlegroups.com. >>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351 >>>>>>>>> 00 >>>>>>>>> 7 >>>>>>>>> 0 >>>>>>>>> % >>>>>>>>> 2 >>>>>>>>> 41 >>>>>>>>> 9f3 >>>>>>>>> 0150%24%40edu. >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> <garbled.jpg><1423894754000.html> >> >> >