Hi Prof,
I am actually working on that, it actually is taking a bit time (around 2 or
3 hours) to run the whole script gen-common-crawl.sh.
A couple of suspicious error also caused me to run and rerun the script a
couple of times .... I need to be careful with testing with that size of
data.

I will keep you updated on the findings and progress.

Thanks
Luke

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:[email protected]] 
Sent: Wednesday, April 22, 2015 3:49 PM
To: Luke
Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
[email protected]; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
[email protected]
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke this is probably a good opportunity to test out your Bayesian
mime detector how does it perform here?

Sent from my iPhone

> On Apr 22, 2015, at 3:29 PM, Luke <[email protected]> wrote:
> 
> Hi professor,
> 
> Please see the following results.
> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"
> 
> 
> Thanks
> Luke
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:[email protected]]
> Sent: Wednesday, April 22, 2015 4:21 AM
> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
> (3980-Affiliate)'; [email protected]
> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
> [email protected]
> Subject: Re: [memex-jpl] this week action from luke
> 
> Hi Luke,
> 
> Actually I just meant go into tika-mimetypes.xml and change the magic
offsets for application/xhtml+xml and see if that works. The code you
changed below is actually how many bytes Tika will first download to do MIME
checking.
> 
> Cheers,
> Chris
> 
> ------------------------
> Chris Mattmann
> [email protected]
> 
> 
> 
> 
> -----Original Message-----
> From: Luke <[email protected]>
> Date: Wednesday, April 22, 2015 at 2:25 AM
> To: Chris Mattmann <[email protected]>, Chris Mattmann
<[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
> <[email protected]>, <[email protected]>
> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, 
> NSF Polar CyberInfrastructure DR Students 
> <[email protected]>,
> <[email protected]>
> Subject: RE: [memex-jpl] this week action from luke
> 
>> 
>> Hi professor,
>> 
>> I just tried it with minLength set to 1024, I get the following 
>> "text/plain"
>> I am a bit surprised....
>> 
>> BTW, the 6000 min length still give "application/xhtml+xml"; with 
>> anything below 1024 min length, I am seeing "text/plain". :)
>> 
>> BTW, the min length I am referring/altering is as follows 
>> MimeTypes.java
>>    public int getMinLength() {
>>       // This needs to be reasonably large to be able to correctly 
>> detect
>>       // things like XML root elements after initial comment and DTDs
>>       return 64 * 1024;
>>   }
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:[email protected]]
>> Sent: Tuesday, April 21, 2015 7:48 PM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>> (3980-Affiliate)'; [email protected]
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>> [email protected]
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Thanks Luke.
>> 
>> So I guess all I was asking was could you try it out. Thanks for the 
>> lesson in the RFC.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> [email protected]
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <[email protected]>
>> Date: Wednesday, April 22, 2015 at 1:46 AM
>> To: Chris Mattmann <[email protected]>, Chris Mattmann 
>> <[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <[email protected]>, <[email protected]>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, 
>> NSF Polar CyberInfrastructure DR Students 
>> <[email protected]>,
>> <[email protected]>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> Hi professor,
>>> 
>>> 
>>> I think it highly depends on the content being read by tika, e.g. if 
>>> there is a sequence of bytes in the file that is being read and is 
>>> the same as one or more of mime types being defined in our 
>>> tika-mimes.xml, I guess that tika will put those types in its 
>>> estimation list, please note there could be multiple estimated mime 
>>> types by magic-byte detection approach. Now tika also considers the 
>>> decision made by extension detection approach, if extension says the 
>>> file type it believes is the first one in the magic type estimation 
>>> list, then certainly the first one will be returned. (the same 
>>> applies to metadata hint approach); Of course, tika also prefers the 
>>> type that is the most specialized.
>>> 
>>> let's get back to the following question, here is my guess though.
>>> [Prof]: Also what happens if you tweak the definition of XHTML to 
>>> not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over
then?
>>> Let's consider an extreme case where we only scan 10 or 1 bytes, 
>>> then it seems that magic bytes will inevitable detect nothing, and I 
>>> think it will return the something like" application/oct-stream" 
>>> that is the most general type. As mentioned, tika favours the one 
>>> that is the most specialized, if extension approach returns the one 
>>> that is more specialized, in this extreme case I believe almost 
>>> every type is a subclass of this "application/oct-stream".... 
>>> therefore the answer in this extreme may be yes, I think it is very 
>>> possible that CBOR type detected by the extension approach takes over in
this case...
>>> 
>>> My idea was and still is that if the cbor self-Describing tag 55799 
>>> is present in the cbor file, then that can be used to detect the cbor
type.
>>> Again, the cbor type will probably be appended into the magic 
>>> estimation list together with another one such as application/html, 
>>> I guess the order in the list probably also matters, the first one 
>>> is preferred over the next one. Also the decision from the extension 
>>> detection approach also play the role the break the tie.
>>> e.g. if extension detection method agrees on cbor with one of the 
>>> estimated type in the magic list, then cbor will be returned. 
>>> (again, same thing applies to metadatahint method).
>>> 
>>> I have not taken a closer look at a cbor file that has the tag 
>>> 55799, but I expect to see its hex is something like 0xd9d9f7 or the 
>>> tag should be present in the header with a fixed sequence of
>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this 
>>> is present in the file or preferable in the header (within a 
>>> reasonable range of bytes ), I believe it can probably be used as 
>>> the magic numbers for the cbor type.
>>> 
>>> 
>>> There is another thing I have mentioned in the jira ticket I opened 
>>> yesterday against the cbor parser and detection, it is also possible 
>>> that cbor content can be imbedded inside a plain json file, the way 
>>> that a decoder can distinguish them in that file is by looking at 
>>> the tag 55799 again. This may rarely happen but a robust parser 
>>> might be able to take care of that, tika might need to consider the 
>>> use of fastXML being used by the nutch tool when developing the cbor
parser...
>>> Again let me cite the same paragraph from the rfc,
>>> 
>>> " a decoder might be able to parse both CBOR and JSON.
>>>  Such a decoder would need to mechanically distinguish the two  
>>> formats.  An easy way for an encoder to help the decoder would be to  
>>> tag the entire CBOR item with tag 55799, the serialization of which  
>>> will never be found at the beginning of a JSON text."
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980) 
>>> [mailto:[email protected]]
>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>> (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; 
>>> [email protected]
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Hi Luke,
>>> 
>>> Can you post the below conversation to dev@tika and summarize it there.
>>> Also what happens if you tweak the definition of XHTML to not scan 
>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: [email protected]
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department University 
>>> of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <[email protected]>
>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U 
>>> (3980-Affiliate)" <[email protected]>, Chris Mattmann 
>>> <[email protected]>
>>> Cc: "Bryant, Ann C (398G-Affiliate)" <[email protected]>, 
>>> "Zimdars, Paul A (3980-Affiliate)" <[email protected]>, 
>>> NSF Polar CyberInfrastructure DR Students 
>>> <[email protected]>,
>>> "[email protected]" <[email protected]>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi Professor,
>>>> Please see attached jpg for the difference.
>>>> Thanks
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: Chris Mattmann [mailto:[email protected]]
>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>> [email protected]
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app 
>>>> -m < /path/to/cbor/file.cbor any difference?
>>>> 
>>>> ------------------------
>>>> Chris Mattmann
>>>> [email protected]
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <[email protected]>
>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>> To: 'Luke' <[email protected]>, Chris Mattmann 
>>>> <[email protected]>, 'Giuseppe Totaro'
>>>> <[email protected]>, Chris Mattmann 
>>>> <[email protected]>
>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
>>>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, 
>>>> NSF Polar CyberInfrastructure DR Students 
>>>> <[email protected]>,
>>>> <[email protected]>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi professor,
>>>>> I just sent a pull request for adding cbor extension.
>>>>> The interesting thing is that tika is still identifying the file 
>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even 
>>>>> when I manually change the file extension to the correct one (i.e.
*.cbor ).
>>>>> 
>>>>> The reason is probably that tika is identifying
"application/xhtml+xml"
>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>> xhtml+nutch,
>>>>> you see that we do have that element as part of the cbor content 
>>>>> because the entire crawled xhtml document seems to be imbedded in 
>>>>> the cbor json(PFA:
>>>>> cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>> higher priority over the glob detection, thus the type is being 
>>>>> incorrectly detected.
>>>>> 
>>>>> Therefore, I would like to please mention that adding the entry of 
>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>> without some fixed magic bytes / patterns for cbor.
>>>>> I also would like to add that the thing will be different with our 
>>>>> probabilistic mime detection selector, because if we know that the 
>>>>> file extension is more reliable than magic bytes, then we can 
>>>>> certainly add more preferential weight to the extension... this 
>>>>> also might show the current implementation with MimeTypes 
>>>>> detection is a bit stiff or less flexible in this scneario. :)
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke [mailto:[email protected]]
>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>> '[email protected]'
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>> Yes, let me add the cbor extension entry in tika xml, will send 
>>>>> the pull request soon.
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>> [email protected]
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>> and tag along with adding an -extension command would be fantastic.
>>>>> Can you file both of those NUTCH issues, wait a day or so, and 
>>>>> then based on feedback use your new Nutch commit karma to get 
>>>>> those into Nutch?
>>>>> 
>>>>> And then when creating the issues, can you link to the TIKA-1610
issue?
>>>>> At that point, when those two to be defined NUTCH issues are up, 
>>>>> Luke, in parallel can you throw up a pull request/patch in Tika 
>>>>> for the extension along with the MIME detection?
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Giuseppe Totaro <[email protected]>
>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>> To: Chris Mattmann <[email protected]>
>>>>> Cc: Luke <[email protected]>, Chris Mattmann 
>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>> Students <[email protected]>,
>>>>> "[email protected]"
>>>>> <[email protected]>
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Thanks Luke. Great work.
>>>>>> Chris, we wrap a single string value, representing the JSON text, 
>>>>>> for each file into CBOR (by using serializeCBORData method). For 
>>>>>> instance, using the Unix hex dump tool, we can see that, as 
>>>>>> expected, the first byte of all files is "0x7F" (the first three 
>>>>>> bits are "011", that is the major type for strings, and the 
>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the 
>>>>>> length of following text), and the following 4 bytes 
>>>>>> (single-precision
>>>>>> float) encodes the right length of file (as described in RFC7049 
>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>> Therefore, a CBOR tag is currently included into the file (a list 
>>>>>> of cbor tags is available here 
>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>> this great research. Chris, if you agree, I can add support for 
>>>>>> prepending self-describing CBOR tag 55799 to 
>>>>>> CommonCrawldataDumper class. I believe it is very easy because I 
>>>>>> have to enable the WRITE_TYPE_HEADER feature for CBORGenerator 
>>>>>> class (the source code is available here 
>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master
>>>>>> /s
>>>>>> r
>>>>>> c
>>>>>> /
>>>>>> m ain
>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>> Then, I can comment the TIKA-1610 
>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>> 
>>>>>> Regarding the file extension, in the Memex CCA format the 
>>>>>> original file extension is used. We could add support for a 
>>>>>> -extension command-line option allowing the user to give a file 
>>>>>> extension (e.g.,
>>>>>> cbor) for all files dumped out.
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Giuseppe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Thanks for this great research, Luke!
>>>>>> 
>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398) NASA 
>>>>>> Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: [email protected]
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department 
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke <[email protected]>
>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe 
>>>>>> U (3980-Affiliate)" <[email protected]>, Chris Mattmann 
>>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>>> Students <[email protected]>,
>>>>>> "[email protected]"
>>>>>> <[email protected]>
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks professor.
>>>>>>> Hi professor and all.
>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>> 
>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>> 
>>>>>>> It looks like there is a self describing tag that needs to be 
>>>>>>> written in the cbor file thru which other applications might be 
>>>>>>> able to identify the cbor type....
>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>> 
>>>>>>> I don't see that tag being present in the cbor file dumped by 
>>>>>>> the nutch tool, I am not very sure though.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Luke
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
>>>>>>> Polar CyberInfrastructure DR Students'; 
>>>>>>> [email protected]
>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>> Nice one, Luke. If you have a second and you can open up an 
>>>>>>> issue in Tika to make it support CBOR, then yes, by all means! 
>>>>>>> :)
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------
>>>>>>> Chris Mattmann
>>>>>>> [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <[email protected]>
>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>> To: 'Giuseppe Totaro' <[email protected]>, Chris Mattmann 
>>>>>>> <[email protected]>, Chris Mattmann 
>>>>>>> <[email protected]>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>>> <[email protected]>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>>>> Students <[email protected]>,
>>>>>>> <[email protected]>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>> 
>>>>>>>> BTW, it looks like Tika might need to consider the support with 
>>>>>>>> COBR parser and detection.
>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>> PFA:
>>>>>>>> rfc_cbor.jpg
>>>>>>>> Actually, I don't quite understand why the 
>>>>>>>> CommonCrawlDataDumper is not dumping the nutch segments with 
>>>>>>>> the .cbor extension, which seems to be helpful for type detection.
>>>>>>>> 
>>>>>>>> To professor Mattmann,
>>>>>>>> Tika does not support the detection of COBR, although the trunk 
>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor 
>>>>>>>> in the tika-mimetypes.xml, those entries are not detecting 
>>>>>>>> properly the cobr files dumped by CommonCrawlDataDumper.  Also 
>>>>>>>> CBOR does not have magic bytes, off the top of my head the only 
>>>>>>>> way we can detect it is using the extension, and content byte 
>>>>>>>> histogram (please note, this is a local optimal solution and
>>>>>>>> data-dependent.)  J
>>>>>>>> 
>>>>>>>> I think I am bit deviating from the main route and discussion 
>>>>>>>> of this thread.... i.e. the plan for testing the "probabilistic 
>>>>>>>> mime detector selection" with polar data.
>>>>>>>> Anyway, I plan to repackage tika by incorporating the 
>>>>>>>> probabilistic selection feature and replace the tika jar in 
>>>>>>>> nutch with the repackaged one, and then run the 
>>>>>>>> CommonCrawlDataDumper and see how it goes. If you have any 
>>>>>>>> specific ideas and thought with the testing, please kindly let me
know.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> From: Giuseppe Totaro [mailto:[email protected]]
>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>> To: Luke liu
>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>> Polar CyberInfrastructure DR Students; 
>>>>>>>> [email protected]
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Luke,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> my name is Giuseppe and I am a PhD student working under the 
>>>>>>>> supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>>> <[email protected]> ha
>>>>>>>> scritto:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was 
>>>>>>>> spending some time in understanding the code repo i.e.
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>> dumping the crawl segments to json files with the human 
>>>>>>>> readable and understandable content.
>>>>>>>> 1) I am trying to run one of the commands on my side as shown 
>>>>>>>> in gen-common-crawl.sh, but the generated files all end with 
>>>>>>>> .html or htm; The command listed in gen-common-crawl.sh seems 
>>>>>>>> to be allude to where the data is located on our 
>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are 
>>>>>>>> not exactly correct (probably they need to be updated), part of 
>>>>>>>> the patterns was able to allow me to locate some similar datasets
(e.g.
>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the 
>>>>>>>> dumped files are all ending with html, but surprisingly inside 
>>>>>>>> those outputted html files, the contents are present in json 
>>>>>>>> format;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The file extension is (almost) always the same as the original
file.
>>>>>>>> More in detail, using the -epochFilename command-line option 
>>>>>>>> (as in gen-common-crawl.sh), the scraped data will be stored 
>>>>>>>> with a filename of the format 
>>>>>>>> <epochtime(milliseconds)>.<filetype>,
>>>>>>>> where <filetype> is either the extension of the original file 
>>>>>>>> or .html as default if the original file does not have an
extension.
>>>>>>>> This schema is used for file naming and it does not depend on 
>>>>>>>> internal output format (JSON).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2) Another problem is that the root object is being set with 
>>>>>>>> some garbled chars in each of the outputted json files (with 
>>>>>>>> extension html in the end), PFA: garbled.jpg and one of the 
>>>>>>>> outputted json file has been also attached as an example too (PFA:
>>>>>>>> 1423894754000.html); the json files cannot be parsed properly 
>>>>>>>> by aggregate.py due to those garbled chars.
>>>>>>>> Even if I get rid of those garbled chars, there are not 
>>>>>>>> mimeTypes element which are being read by aggregate.py.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Text content and metadata extracted from the crawled binary 
>>>>>>>> data are stored in a structured document format (JSON). 
>>>>>>>> Furthermore, this document is encoded using CBOR 
>>>>>>>> <http://cbor.io/> serialization. Each not human-readable 
>>>>>>>> character that you notice in front and at the end of JSON data is
due to CBOR-encoding.
>>>>>>>> Thus, if you need to read JSON data from document dumped out by 
>>>>>>>> CommonCrawlDataDumper, you have to deserialized the 
>>>>>>>> CBOR-encoded data structure inside the file.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope this short overview can help in you work. I really 
>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>> great job in detection.
>>>>>>>> 
>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>> not hesitate to contact me if you may need any further information.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Giuseppe
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>> information (present in the readme of the code repo) is not 
>>>>>>>> being collected and computed by aggregate.py from those output 
>>>>>>>> json files but it looks like it is coming from the log.... see 
>>>>>>>> the following as an example:
>>>>>>>> 
>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>> TOTAL Stats:
>>>>>>>> [
>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>> ]
>>>>>>>> 
>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>> think I understand the whole idea and I do concur with it, 
>>>>>>>> might be we can repackage the tika by incorporating the feature
(i.e.
>>>>>>>> probabilistic mime
>>>>>>>> selection) in it and see if it can output the same information 
>>>>>>>> as the one without it in the log.
>>>>>>>> 
>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>> selection:
>>>>>>>> in my pull request, I added a simple test case which might tell 
>>>>>>>> a bit more about how the feature is called and used, it is 
>>>>>>>> simple though.
>>>>>>>> Here is an example snippet
>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>> information with this can be found in
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>> mimeTypes::
>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should 
>>>>>>>> create another pull request with this change for
>>>>>>>> TikaConfig)
>>>>>>>> 
>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>> comment, please do kindly let me know, then I will start 
>>>>>>>> working on my 'finale'. BTW, don't worry, even after I am 
>>>>>>>> graduated, the graduation is not my termination with tika and 
>>>>>>>> this project, after then I still can and want to help this 
>>>>>>>> polar project and tika as much as possible, and correct the 
>>>>>>>> programming faults and bugs, respond to the tika issues ,etc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>> [email protected]
>>>>>>>> Subject: Re: this week action from luke
>>>>>>>> Importance: High
>>>>>>>> 
>>>>>>>> Awesome Luke. I am going to work specifically on now 
>>>>>>>> benchmarking your code in real situations. For example, it 
>>>>>>>> would be fantastic to now run your Bayesian MIME detector over 
>>>>>>>> the whole NSF TREC Dynamic Domain data for Polar described here:
>>>>>>>> 
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, 
>>>>>>>> and Annie can explain it, also CC'ed.
>>>>>>>> 
>>>>>>>> Can we make that your goal for the next 2 weeks to actually 
>>>>>>>> test it and produce a real result over the whole TREC-DD data 
>>>>>>>> for Polar? My goal will be to get your code committed and 
>>>>>>>> integrated into Tika.
>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>> 
>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>> to try and test more accurate file identification with Tika, 
>>>>>>>> Luke is the guy to talk to and I have him for 2 more weeks.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> [email protected]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke liu <[email protected]>
>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>> To: Chris Mattmann <[email protected]>, Chris Mattmann 
>>>>>>>> <[email protected]>
>>>>>>>> Cc: 'Luke' <[email protected]>
>>>>>>>> Subject: this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Professor Mattmann,
>>>>>>>> 
>>>>>>>> I think I am in the final phase of the research, and last week 
>>>>>>>> I finished the last item in the list, and hopefully everything 
>>>>>>>> will be fine.
>>>>>>>> 
>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>> done...and it will be also great if you can please comment on 
>>>>>>>> my work (the 2 pull
>>>>>>>> requests) when you have time.
>>>>>>>> 
>>>>>>>> If you do have confusion with any of my work, please also do 
>>>>>>>> let me know.
>>>>>>>> 
>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>> )when I have time.
>>>>>>>> 
>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>> important, please do let me know too.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>> it, send an email to [email protected]
>>>>>>>> <mailto:memex-jpl%[email protected]>.
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351
>>>>>>>> 00
>>>>>>>> 7
>>>>>>>> 0
>>>>>>>> %
>>>>>>>> 2
>>>>>>>> 41
>>>>>>>> 9f3
>>>>>>>> 0150%24%40edu.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>> <garbled.jpg><1423894754000.html>
> 
> 

Reply via email to