Thanks Luke this is probably a good opportunity to test out your Bayesian mime detector how does it perform here?
Sent from my iPhone > On Apr 22, 2015, at 3:29 PM, Luke <[email protected]> wrote: > > Hi professor, > > Please see the following results. > <match value="<html xmlns=" type="string" offset="0:1024"/> > Result: "text/html" > > <match value="<html xmlns=" type="string" offset="0:6000"/> > Result: "application/xhtml+xml" > > > Thanks > Luke > > -----Original Message----- > From: Chris Mattmann [mailto:[email protected]] > Sent: Wednesday, April 22, 2015 4:21 AM > To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; > [email protected] > Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; > 'NSF Polar CyberInfrastructure DR Students'; [email protected] > Subject: Re: [memex-jpl] this week action from luke > > Hi Luke, > > Actually I just meant go into tika-mimetypes.xml and change the magic offsets > for application/xhtml+xml and see if that works. The code you changed below > is actually how many bytes Tika will first download to do MIME checking. > > Cheers, > Chris > > ------------------------ > Chris Mattmann > [email protected] > > > > > -----Original Message----- > From: Luke <[email protected]> > Date: Wednesday, April 22, 2015 at 2:25 AM > To: Chris Mattmann <[email protected]>, Chris Mattmann > <[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'" > <[email protected]>, <[email protected]> > Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, "'Zimdars, > Paul A (3980-Affiliate)'" <[email protected]>, NSF Polar > CyberInfrastructure DR Students <[email protected]>, > <[email protected]> > Subject: RE: [memex-jpl] this week action from luke > >> >> Hi professor, >> >> I just tried it with minLength set to 1024, I get the following >> "text/plain" >> I am a bit surprised.... >> >> BTW, the 6000 min length still give "application/xhtml+xml"; with >> anything below 1024 min length, I am seeing "text/plain". :) >> >> BTW, the min length I am referring/altering is as follows >> MimeTypes.java >> public int getMinLength() { >> // This needs to be reasonably large to be able to correctly >> detect >> // things like XML root elements after initial comment and DTDs >> return 64 * 1024; >> } >> >> >> Thanks >> Luke >> >> -----Original Message----- >> From: Chris Mattmann [mailto:[email protected]] >> Sent: Tuesday, April 21, 2015 7:48 PM >> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U >> (3980-Affiliate)'; [email protected] >> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >> [email protected] >> Subject: Re: [memex-jpl] this week action from luke >> >> Thanks Luke. >> >> So I guess all I was asking was could you try it out. Thanks for the >> lesson in the RFC. >> >> Cheers, >> Chris >> >> ------------------------ >> Chris Mattmann >> [email protected] >> >> >> >> >> -----Original Message----- >> From: Luke <[email protected]> >> Date: Wednesday, April 22, 2015 at 1:46 AM >> To: Chris Mattmann <[email protected]>, Chris Mattmann >> <[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'" >> <[email protected]>, <[email protected]> >> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, >> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, NSF >> Polar CyberInfrastructure DR Students >> <[email protected]>, >> <[email protected]> >> Subject: RE: [memex-jpl] this week action from luke >> >>> Hi professor, >>> >>> >>> I think it highly depends on the content being read by tika, e.g. if >>> there is a sequence of bytes in the file that is being read and is the >>> same as one or more of mime types being defined in our tika-mimes.xml, >>> I guess that tika will put those types in its estimation list, please >>> note there could be multiple estimated mime types by magic-byte >>> detection approach. Now tika also considers the decision made by >>> extension detection approach, if extension says the file type it >>> believes is the first one in the magic type estimation list, then >>> certainly the first one will be returned. (the same applies to >>> metadata hint approach); Of course, tika also prefers the type that is >>> the most specialized. >>> >>> let's get back to the following question, here is my guess though. >>> [Prof]: Also what happens if you tweak the definition of XHTML to not >>> scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? >>> Let's consider an extreme case where we only scan 10 or 1 bytes, then >>> it seems that magic bytes will inevitable detect nothing, and I think >>> it will return the something like" application/oct-stream" that is the >>> most general type. As mentioned, tika favours the one that is the most >>> specialized, if extension approach returns the one that is more >>> specialized, in this extreme case I believe almost every type is a >>> subclass of this "application/oct-stream".... therefore the answer in >>> this extreme may be yes, I think it is very possible that CBOR type >>> detected by the extension approach takes over in this case... >>> >>> My idea was and still is that if the cbor self-Describing tag 55799 is >>> present in the cbor file, then that can be used to detect the cbor type. >>> Again, the cbor type will probably be appended into the magic >>> estimation list together with another one such as application/html, I >>> guess the order in the list probably also matters, the first one is >>> preferred over the next one. Also the decision from the extension >>> detection approach also play the role the break the tie. >>> e.g. if extension detection method agrees on cbor with one of the >>> estimated type in the magic list, then cbor will be returned. (again, >>> same thing applies to metadatahint method). >>> >>> I have not taken a closer look at a cbor file that has the tag 55799, >>> but I expect to see its hex is something like 0xd9d9f7 or the tag >>> should be present in the header with a fixed sequence of >>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is >>> present in the file or preferable in the header (within a reasonable >>> range of bytes ), I believe it can probably be used as the magic >>> numbers for the cbor type. >>> >>> >>> There is another thing I have mentioned in the jira ticket I opened >>> yesterday against the cbor parser and detection, it is also possible >>> that cbor content can be imbedded inside a plain json file, the way >>> that a decoder can distinguish them in that file is by looking at the >>> tag 55799 again. This may rarely happen but a robust parser might be >>> able to take care of that, tika might need to consider the use of >>> fastXML being used by the nutch tool when developing the cbor parser... >>> Again let me cite the same paragraph from the rfc, >>> >>> " a decoder might be able to parse both CBOR and JSON. >>> Such a decoder would need to mechanically distinguish the two >>> formats. An easy way for an encoder to help the decoder would be to >>> tag the entire CBOR item with tag 55799, the serialization of which >>> will never be found at the beginning of a JSON text." >>> >>> >>> Thanks >>> Luke >>> >>> >>> >>> -----Original Message----- >>> From: Mattmann, Chris A (3980) [mailto:[email protected]] >>> Sent: Tuesday, April 21, 2015 9:49 PM >>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate) >>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); >>> 'NSF Polar CyberInfrastructure DR Students'; >>> [email protected] >>> Subject: Re: [memex-jpl] this week action from luke >>> >>> Hi Luke, >>> >>> Can you post the below conversation to dev@tika and summarize it there. >>> Also what happens if you tweak the definition of XHTML to not scan >>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? >>> >>> Cheers, >>> Chris >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) NASA Jet >>> Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department University of >>> Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Luke <[email protected]> >>> Date: Wednesday, April 22, 2015 at 12:19 AM >>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U >>> (3980-Affiliate)" <[email protected]>, Chris Mattmann >>> <[email protected]> >>> Cc: "Bryant, Ann C (398G-Affiliate)" <[email protected]>, >>> "Zimdars, Paul A (3980-Affiliate)" <[email protected]>, NSF >>> Polar CyberInfrastructure DR Students >>> <[email protected]>, >>> "[email protected]" <[email protected]> >>> Subject: RE: [memex-jpl] this week action from luke >>> >>>> Hi Professor, >>>> Please see attached jpg for the difference. >>>> Thanks >>>> Luke >>>> >>>> -----Original Message----- >>>> From: Chris Mattmann [mailto:[email protected]] >>>> Sent: Tuesday, April 21, 2015 5:27 PM >>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)' >>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>>> [email protected] >>>> Subject: Re: [memex-jpl] this week action from luke >>>> >>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m >>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m >>>> < /path/to/cbor/file.cbor any difference? >>>> >>>> ------------------------ >>>> Chris Mattmann >>>> [email protected] >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Luke <[email protected]> >>>> Date: Tuesday, April 21, 2015 at 5:41 PM >>>> To: 'Luke' <[email protected]>, Chris Mattmann >>>> <[email protected]>, 'Giuseppe Totaro' >>>> <[email protected]>, Chris Mattmann >>>> <[email protected]> >>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, >>>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, >>>> NSF Polar CyberInfrastructure DR Students >>>> <[email protected]>, >>>> <[email protected]> >>>> Subject: RE: [memex-jpl] this week action from luke >>>> >>>>> Hi professor, >>>>> I just sent a pull request for adding cbor extension. >>>>> The interesting thing is that tika is still identifying the file >>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even when >>>>> I manually change the file extension to the correct one (i.e. *.cbor ). >>>>> >>>>> The reason is probably that tika is identifying "application/xhtml+xml" >>>>> by searching for the "<html" in the file content, PFA: >>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by >>>>> xhtml+nutch, >>>>> you see that we do have that element as part of the cbor content >>>>> because the entire crawled xhtml document seems to be imbedded in >>>>> the cbor json(PFA: >>>>> cbor.jpg); and also in Tika, the magic detection seems to have >>>>> higher priority over the glob detection, thus the type is being >>>>> incorrectly detected. >>>>> >>>>> Therefore, I would like to please mention that adding the entry of >>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now >>>>> without some fixed magic bytes / patterns for cbor. >>>>> I also would like to add that the thing will be different with our >>>>> probabilistic mime detection selector, because if we know that the >>>>> file extension is more reliable than magic bytes, then we can >>>>> certainly add more preferential weight to the extension... this also >>>>> might show the current implementation with MimeTypes detection is a >>>>> bit stiff or less flexible in this scneario. :) >>>>> >>>>> >>>>> Thanks >>>>> Luke >>>>> >>>>> -----Original Message----- >>>>> From: Luke [mailto:[email protected]] >>>>> Sent: Tuesday, April 21, 2015 12:14 PM >>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)' >>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>>>> '[email protected]' >>>>> Subject: RE: [memex-jpl] this week action from luke >>>>> >>>>> Yes, let me add the cbor extension entry in tika xml, will send the >>>>> pull request soon. >>>>> >>>>> Thanks >>>>> Luke >>>>> -----Original Message----- >>>>> From: Chris Mattmann [mailto:[email protected]] >>>>> Sent: Tuesday, April 21, 2015 6:51 AM >>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980) >>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A >>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; >>>>> [email protected] >>>>> Subject: Re: [memex-jpl] this week action from luke >>>>> >>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER >>>>> and tag along with adding an -extension command would be fantastic. >>>>> Can you file both of those NUTCH issues, wait a day or so, and then >>>>> based on feedback use your new Nutch commit karma to get those into >>>>> Nutch? >>>>> >>>>> And then when creating the issues, can you link to the TIKA-1610 issue? >>>>> At that point, when those two to be defined NUTCH issues are up, >>>>> Luke, in parallel can you throw up a pull request/patch in Tika for >>>>> the extension along with the MIME detection? >>>>> >>>>> Cheers, >>>>> Chris >>>>> >>>>> ------------------------ >>>>> Chris Mattmann >>>>> [email protected] >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Giuseppe Totaro <[email protected]> >>>>> Date: Tuesday, April 21, 2015 at 12:33 PM >>>>> To: Chris Mattmann <[email protected]> >>>>> Cc: Luke <[email protected]>, Chris Mattmann >>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)" >>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)" >>>>> <[email protected]>, NSF Polar CyberInfrastructure DR >>>>> Students <[email protected]>, >>>>> "[email protected]" >>>>> <[email protected]> >>>>> Subject: Re: [memex-jpl] this week action from luke >>>>> >>>>>> Thanks Luke. Great work. >>>>>> Chris, we wrap a single string value, representing the JSON text, >>>>>> for each file into CBOR (by using serializeCBORData method). For >>>>>> instance, using the Unix hex dump tool, we can see that, as >>>>>> expected, the first byte of all files is "0x7F" (the first three >>>>>> bits are "011", that is the major type for strings, and the >>>>>> following 5 bits are "11010", meaning a uint32_t encodes the length >>>>>> of following text), and the following 4 bytes (single-precision >>>>>> float) encodes the right length of file (as described in RFC7049 >>>>>> <http://tools.ietf.org/html/rfc7049>). >>>>>> Therefore, a CBOR tag is currently included into the file (a list >>>>>> of cbor tags is available here >>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>). >>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for >>>>>> this great research. Chris, if you agree, I can add support for >>>>>> prepending self-describing CBOR tag 55799 to CommonCrawldataDumper >>>>>> class. I believe it is very easy because I have to enable the >>>>>> WRITE_TYPE_HEADER feature for CBORGenerator class (the source code >>>>>> is available here >>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/s >>>>>> r >>>>>> c >>>>>> / >>>>>> m ain >>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>). >>>>>> Then, I can comment the TIKA-1610 >>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue. >>>>>> >>>>>> Regarding the file extension, in the Memex CCA format the original >>>>>> file extension is used. We could add support for a -extension >>>>>> command-line option allowing the user to give a file extension >>>>>> (e.g., >>>>>> cbor) for all files dumped out. >>>>>> >>>>>> Thanks a lot, >>>>>> Giuseppe >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) >>>>>> <[email protected]> wrote: >>>>>> >>>>>> Thanks for this great research, Luke! >>>>>> >>>>>> Giuseppe, any idea why this tag doesn't make it into the file? >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Chief Architect >>>>>> Instrument Software and Science Data Systems Section (398) NASA Jet >>>>>> Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 168-519, Mailstop: 168-527 >>>>>> Email: [email protected] >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Associate Professor, Computer Science Department University >>>>>> of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Luke <[email protected]> >>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM >>>>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U >>>>>> (3980-Affiliate)" <[email protected]>, Chris Mattmann >>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)" >>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)" >>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR >>>>>> Students <[email protected]>, >>>>>> "[email protected]" >>>>>> <[email protected]> >>>>>> Subject: RE: [memex-jpl] this week action from luke >>>>>> >>>>>>> Thanks professor. >>>>>>> Hi professor and all. >>>>>>> JIRA issue : CBOR Parser and detection improvement >>>>>>> https://issues.apache.org/jira/browse/TIKA-1610 >>>>>>> >>>>>>> I tried to conduct a bit research with this cbor detection. >>>>>>> >>>>>>> It looks like there is a self describing tag that needs to be >>>>>>> written in the cbor file thru which other applications might be >>>>>>> able to identify the cbor type.... >>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5 >>>>>>> >>>>>>> I don't see that tag being present in the cbor file dumped by the >>>>>>> nutch tool, I am not very sure though. >>>>>>> >>>>>>> Thanks >>>>>>> Luke >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Chris Mattmann [mailto:[email protected]] >>>>>>> Sent: Monday, April 20, 2015 4:10 AM >>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C >>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar >>>>>>> CyberInfrastructure DR Students'; [email protected] >>>>>>> Subject: Re: [memex-jpl] this week action from luke >>>>>>> >>>>>>> Nice one, Luke. If you have a second and you can open up an issue >>>>>>> in Tika to make it support CBOR, then yes, by all means! :) >>>>>>> >>>>>>> >>>>>>> ------------------------ >>>>>>> Chris Mattmann >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Luke <[email protected]> >>>>>>> Date: Monday, April 20, 2015 at 4:15 AM >>>>>>> To: 'Giuseppe Totaro' <[email protected]>, Chris Mattmann >>>>>>> <[email protected]>, Chris Mattmann >>>>>>> <[email protected]>, "'Bryant, Ann C (398G-Affiliate)'" >>>>>>> <[email protected]>, "'Zimdars, Paul A (3980-Affiliate)'" >>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR >>>>>>> Students <[email protected]>, >>>>>>> <[email protected]> >>>>>>> Subject: RE: [memex-jpl] this week action from luke >>>>>>> >>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit >>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , appreciated. >>>>>>>> >>>>>>>> BTW, it looks like Tika might need to consider the support with >>>>>>>> COBR parser and detection. >>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers. >>>>>>>> PFA: >>>>>>>> rfc_cbor.jpg >>>>>>>> Actually, I don't quite understand why the CommonCrawlDataDumper >>>>>>>> is not dumping the nutch segments with the .cbor extension, which >>>>>>>> seems to be helpful for type detection. >>>>>>>> >>>>>>>> To professor Mattmann, >>>>>>>> Tika does not support the detection of COBR, although the trunk >>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in >>>>>>>> the tika-mimetypes.xml, those entries are not detecting properly >>>>>>>> the cobr files dumped by CommonCrawlDataDumper. Also CBOR does >>>>>>>> not have magic bytes, off the top of my head the only way we can >>>>>>>> detect it is using the extension, and content byte histogram >>>>>>>> (please note, this is a local optimal solution and >>>>>>>> data-dependent.) J >>>>>>>> >>>>>>>> I think I am bit deviating from the main route and discussion of >>>>>>>> this thread.... i.e. the plan for testing the "probabilistic mime >>>>>>>> detector selection" with polar data. >>>>>>>> Anyway, I plan to repackage tika by incorporating the >>>>>>>> probabilistic selection feature and replace the tika jar in nutch >>>>>>>> with the repackaged one, and then run the CommonCrawlDataDumper >>>>>>>> and see how it goes. If you have any specific ideas and thought >>>>>>>> with the testing, please kindly let me know. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Luke >>>>>>>> >>>>>>>> From: Giuseppe Totaro [mailto:[email protected]] >>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM >>>>>>>> To: Luke liu >>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C >>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF >>>>>>>> Polar CyberInfrastructure DR Students; [email protected] >>>>>>>> Subject: Re: [memex-jpl] this week action from luke >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Luke, >>>>>>>> >>>>>>>> >>>>>>>> my name is Giuseppe and I am a PhD student working under the >>>>>>>> supervision of Prof. Chris Mattmann. I worked on >>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a >>>>>>>> couple of your observations. My comments inline below. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu >>>>>>>> <[email protected]> ha >>>>>>>> scritto: >>>>>>>> >>>>>>>> >>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was spending >>>>>>>> some time in understanding the code repo i.e. >>>>>>>> http://github.com/chrismattmann/trec-dd-polar/ >>>>>>>> >>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is >>>>>>>> dumping the crawl segments to json files with the human readable >>>>>>>> and understandable content. >>>>>>>> 1) I am trying to run one of the commands on my side as shown in >>>>>>>> gen-common-crawl.sh, but the generated files all end with .html >>>>>>>> or htm; The command listed in gen-common-crawl.sh seems to be >>>>>>>> allude to where the data is located on our >>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> >>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are not >>>>>>>> exactly correct (probably they need to be updated), part of the >>>>>>>> patterns was able to allow me to locate some similar datasets (e.g. >>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped >>>>>>>> files are all ending with html, but surprisingly inside those >>>>>>>> outputted html files, the contents are present in json format; >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The file extension is (almost) always the same as the original file. >>>>>>>> More in detail, using the -epochFilename command-line option (as >>>>>>>> in gen-common-crawl.sh), the scraped data will be stored with a >>>>>>>> filename of the format <epochtime(milliseconds)>.<filetype>, >>>>>>>> where <filetype> is either the extension of the original file or >>>>>>>> .html as default if the original file does not have an extension. >>>>>>>> This schema is used for file naming and it does not depend on >>>>>>>> internal output format (JSON). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2) Another problem is that the root object is being set with some >>>>>>>> garbled chars in each of the outputted json files (with extension >>>>>>>> html in the end), PFA: garbled.jpg and one of the outputted json >>>>>>>> file has been also attached as an example too (PFA: >>>>>>>> 1423894754000.html); the json files cannot be parsed properly by >>>>>>>> aggregate.py due to those garbled chars. >>>>>>>> Even if I get rid of those garbled chars, there are not mimeTypes >>>>>>>> element which are being read by aggregate.py. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Text content and metadata extracted from the crawled binary data >>>>>>>> are stored in a structured document format (JSON). Furthermore, >>>>>>>> this document is encoded using CBOR <http://cbor.io/> >>>>>>>> serialization. Each not human-readable character that you notice >>>>>>>> in front and at the end of JSON data is due to CBOR-encoding. >>>>>>>> Thus, if you need to read JSON data from document dumped out by >>>>>>>> CommonCrawlDataDumper, you have to deserialized the CBOR-encoded >>>>>>>> data structure inside the file. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I hope this short overview can help in you work. I really >>>>>>>> appreciate your feedback and, by the way, thanks a lot for your >>>>>>>> great job in detection. >>>>>>>> >>>>>>>> I am available to provide you all support I can give, so you do >>>>>>>> not hesitate to contact me if you may need any further information. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Giuseppe >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Finally, after some research, I guess that the statistical >>>>>>>> information (present in the readme of the code repo) is not being >>>>>>>> collected and computed by aggregate.py from those output json >>>>>>>> files but it looks like it is coming from the log.... see the >>>>>>>> following as an example: >>>>>>>> >>>>>>>> 2015-04-19 04:55:42,078 INFO tools.CommonCrawlDataDumper - >>>>>>>> CommonsCrawlDataDumper File Stats: >>>>>>>> TOTAL Stats: >>>>>>>> [ >>>>>>>> {"mimeType":"application/x-tika-msoffice","count":"17"} >>>>>>>> {"mimeType":"application/vnd.ms-excel","count":"7"} >>>>>>>> {"mimeType":"application/xhtml+xml","count":"3000"} >>>>>>>> {"mimeType":"application/octet-stream","count":"641"} >>>>>>>> {"mimeType":"application/epub+zip","count":"1"} >>>>>>>> {"mimeType":"application/zip","count":"6"} >>>>>>>> {"mimeType":"application/xml","count":"11"} >>>>>>>> {"mimeType":"image/png","count":"110"} >>>>>>>> {"mimeType":"image/jpeg","count":"70"} >>>>>>>> {"mimeType":"application/atom+xml","count":"213"} >>>>>>>> {"mimeType":"application/rss+xml","count":"43"} >>>>>>>> {"mimeType":"video/mp4","count":"3"} >>>>>>>> {"mimeType":"text/plain","count":"104"} >>>>>>>> {"mimeType":"application/rdf+xml","count":"2"} >>>>>>>> {"mimeType":"image/gif","count":"2"} >>>>>>>> {"mimeType":"text/x-php","count":"1"} >>>>>>>> {"mimeType":"video/x-msvideo","count":"1"} >>>>>>>> {"mimeType":"application/x-tika-ooxml","count":"3"} >>>>>>>> {"mimeType":"text/html","count":"9506"} >>>>>>>> {"mimeType":"application/pdf","count":"280"} >>>>>>>> ] >>>>>>>> >>>>>>>> It turns out that aggregate.py is not the one that produces the >>>>>>>> statistical information, not sure what it does... but anyway, I >>>>>>>> think I understand the whole idea and I do concur with it, might >>>>>>>> be we can repackage the tika by incorporating the feature (i.e. >>>>>>>> probabilistic mime >>>>>>>> selection) in it and see if it can output the same information as >>>>>>>> the one without it in the log. >>>>>>>> >>>>>>>> BTW, Regarding the use of the feature with probabilistic mime >>>>>>>> selection: >>>>>>>> in my pull request, I added a simple test case which might tell a >>>>>>>> bit more about how the feature is called and used, it is simple >>>>>>>> though. >>>>>>>> Here is an example snippet >>>>>>>> ProbabilisticMimeDetectionSelector probSel = new >>>>>>>> ProbabilisticMimeDetectionSelector(); >>>>>>>> probSel.detect(input::InputStream, metadata:: >>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more >>>>>>>> information with this can be found in >>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517) >>>>>>>> Now, in order to allow the Tika().detect() to call the >>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as >>>>>>>> Tika().detect() is being called by commoncrawldump), we need to >>>>>>>> modify/add some code in the TikaConfig which initializes a list >>>>>>>> of default detectors, and we need to get rid of the detector - >>>>>>>> mimeTypes:: >>>>>>>> MimeTypes in the list and replace it with probSel:: >>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should create >>>>>>>> another pull request with this change for >>>>>>>> TikaConfig) >>>>>>>> >>>>>>>> I think that is all of my initial thought with some finding and >>>>>>>> plan; if you have anything you would like to please add and >>>>>>>> comment, please do kindly let me know, then I will start working >>>>>>>> on my 'finale'. BTW, don't worry, even after I am graduated, the >>>>>>>> graduation is not my termination with tika and this project, >>>>>>>> after then I still can and want to help this polar project and >>>>>>>> tika as much as possible, and correct the programming faults and >>>>>>>> bugs, respond to the tika issues ,etc. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Luke >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Chris Mattmann [mailto:[email protected]] >>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM >>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C >>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate) >>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; >>>>>>>> [email protected] >>>>>>>> Subject: Re: this week action from luke >>>>>>>> Importance: High >>>>>>>> >>>>>>>> Awesome Luke. I am going to work specifically on now benchmarking >>>>>>>> your code in real situations. For example, it would be fantastic >>>>>>>> to now run your Bayesian MIME detector over the whole NSF TREC >>>>>>>> Dynamic Domain data for Polar described here: >>>>>>>> >>>>>>>> http://github.com/chrismattmann/trec-dd-polar/ >>>>>>>> >>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, and >>>>>>>> Annie can explain it, also CC'ed. >>>>>>>> >>>>>>>> Can we make that your goal for the next 2 weeks to actually test >>>>>>>> it and produce a real result over the whole TREC-DD data for >>>>>>>> Polar? My goal will be to get your code committed and integrated >>>>>>>> into Tika. >>>>>>>> The more you can write me a guide of how to build and test your >>>>>>>> code with Tika so I can get it committed the better. >>>>>>>> >>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is >>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's >>>>>>>> existing MIME detection approach. If folks have any Memex needs >>>>>>>> to try and test more accurate file identification with Tika, Luke >>>>>>>> is the guy to talk to and I have him for 2 more weeks. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Chris >>>>>>>> >>>>>>>> ------------------------ >>>>>>>> Chris Mattmann >>>>>>>> [email protected] >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Luke liu <[email protected]> >>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM >>>>>>>> To: Chris Mattmann <[email protected]>, Chris Mattmann >>>>>>>> <[email protected]> >>>>>>>> Cc: 'Luke' <[email protected]> >>>>>>>> Subject: this week action from luke >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Professor Mattmann, >>>>>>>> >>>>>>>> I think I am in the final phase of the research, and last week I >>>>>>>> finished the last item in the list, and hopefully everything will >>>>>>>> be fine. >>>>>>>> >>>>>>>> For now, i probably can spend some time in verifying or >>>>>>>> optimizing the codes, the majority of the research has been >>>>>>>> done...and it will be also great if you can please comment on my >>>>>>>> work (the 2 pull >>>>>>>> requests) when you have time. >>>>>>>> >>>>>>>> If you do have confusion with any of my work, please also do let >>>>>>>> me know. >>>>>>>> >>>>>>>> Thanks and I am glad working with you, for the next a couple of >>>>>>>> weeks before graduation, I am going to continue revising and >>>>>>>> testing the code and features to get rid of some flaws (if any >>>>>>>> )when I have time. >>>>>>>> >>>>>>>> Not sure if I miss out something, and if I do miss some thing >>>>>>>> important, please do let me know too. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Luke >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the >>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected] >>>>>>>> <mailto:memex-jpl%[email protected]>. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b35100 >>>>>>>> 7 >>>>>>>> 0 >>>>>>>> % >>>>>>>> 2 >>>>>>>> 41 >>>>>>>> 9f3 >>>>>>>> 0150%24%40edu. >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> <garbled.jpg><1423894754000.html> > >
