Hi professor, Please see the following results. <match value="<html xmlns=" type="string" offset="0:1024"/> Result: "text/html"
<match value="<html xmlns=" type="string" offset="0:6000"/> Result: "application/xhtml+xml" Thanks Luke -----Original Message----- From: Chris Mattmann [mailto:[email protected]] Sent: Wednesday, April 22, 2015 4:21 AM To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; [email protected] Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; [email protected] Subject: Re: [memex-jpl] this week action from luke Hi Luke, Actually I just meant go into tika-mimetypes.xml and change the magic offsets for application/xhtml+xml and see if that works. The code you changed below is actually how many bytes Tika will first download to do MIME checking. Cheers, Chris ------------------------ Chris Mattmann [email protected] -----Original Message----- From: Luke <[email protected]> Date: Wednesday, April 22, 2015 at 2:25 AM To: Chris Mattmann <[email protected]>, Chris Mattmann <[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'" <[email protected]>, <[email protected]> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, NSF Polar CyberInfrastructure DR Students <[email protected]>, <[email protected]> Subject: RE: [memex-jpl] this week action from luke > >Hi professor, > >I just tried it with minLength set to 1024, I get the following >"text/plain" >I am a bit surprised.... > >BTW, the 6000 min length still give "application/xhtml+xml"; with >anything below 1024 min length, I am seeing "text/plain". :) > >BTW, the min length I am referring/altering is as follows >MimeTypes.java > public int getMinLength() { > // This needs to be reasonably large to be able to correctly >detect > // things like XML root elements after initial comment and DTDs > return 64 * 1024; > } > > >Thanks >Luke > >-----Original Message----- >From: Chris Mattmann [mailto:[email protected]] >Sent: Tuesday, April 21, 2015 7:48 PM >To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U >(3980-Affiliate)'; [email protected] >Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >[email protected] >Subject: Re: [memex-jpl] this week action from luke > >Thanks Luke. > >So I guess all I was asking was could you try it out. Thanks for the >lesson in the RFC. > >Cheers, >Chris > >------------------------ >Chris Mattmann >[email protected] > > > > >-----Original Message----- >From: Luke <[email protected]> >Date: Wednesday, April 22, 2015 at 1:46 AM >To: Chris Mattmann <[email protected]>, Chris Mattmann ><[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'" ><[email protected]>, <[email protected]> >Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, >"'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, NSF >Polar CyberInfrastructure DR Students ><[email protected]>, ><[email protected]> >Subject: RE: [memex-jpl] this week action from luke > >>Hi professor, >> >> >>I think it highly depends on the content being read by tika, e.g. if >>there is a sequence of bytes in the file that is being read and is the >>same as one or more of mime types being defined in our tika-mimes.xml, >>I guess that tika will put those types in its estimation list, please >>note there could be multiple estimated mime types by magic-byte >>detection approach. Now tika also considers the decision made by >>extension detection approach, if extension says the file type it >>believes is the first one in the magic type estimation list, then >>certainly the first one will be returned. (the same applies to >>metadata hint approach); Of course, tika also prefers the type that is >>the most specialized. >> >>let's get back to the following question, here is my guess though. >>[Prof]: Also what happens if you tweak the definition of XHTML to not >>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? >>Let's consider an extreme case where we only scan 10 or 1 bytes, then >>it seems that magic bytes will inevitable detect nothing, and I think >>it will return the something like" application/oct-stream" that is the >>most general type. As mentioned, tika favours the one that is the most >>specialized, if extension approach returns the one that is more >>specialized, in this extreme case I believe almost every type is a >>subclass of this "application/oct-stream".... therefore the answer in >>this extreme may be yes, I think it is very possible that CBOR type >>detected by the extension approach takes over in this case... >> >>My idea was and still is that if the cbor self-Describing tag 55799 is >>present in the cbor file, then that can be used to detect the cbor type. >>Again, the cbor type will probably be appended into the magic >>estimation list together with another one such as application/html, I >>guess the order in the list probably also matters, the first one is >>preferred over the next one. Also the decision from the extension >>detection approach also play the role the break the tie. >>e.g. if extension detection method agrees on cbor with one of the >>estimated type in the magic list, then cbor will be returned. (again, >>same thing applies to metadatahint method). >> >>I have not taken a closer look at a cbor file that has the tag 55799, >>but I expect to see its hex is something like 0xd9d9f7 or the tag >>should be present in the header with a fixed sequence of >>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is >>present in the file or preferable in the header (within a reasonable >>range of bytes ), I believe it can probably be used as the magic >>numbers for the cbor type. >> >> >>There is another thing I have mentioned in the jira ticket I opened >>yesterday against the cbor parser and detection, it is also possible >>that cbor content can be imbedded inside a plain json file, the way >>that a decoder can distinguish them in that file is by looking at the >>tag 55799 again. This may rarely happen but a robust parser might be >>able to take care of that, tika might need to consider the use of >>fastXML being used by the nutch tool when developing the cbor parser... >>Again let me cite the same paragraph from the rfc, >> >>" a decoder might be able to parse both CBOR and JSON. >> Such a decoder would need to mechanically distinguish the two >> formats. An easy way for an encoder to help the decoder would be to >> tag the entire CBOR item with tag 55799, the serialization of which >> will never be found at the beginning of a JSON text." >> >> >>Thanks >>Luke >> >> >> >>-----Original Message----- >>From: Mattmann, Chris A (3980) [mailto:[email protected]] >>Sent: Tuesday, April 21, 2015 9:49 PM >>To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate) >>Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); >>'NSF Polar CyberInfrastructure DR Students'; >>[email protected] >>Subject: Re: [memex-jpl] this week action from luke >> >>Hi Luke, >> >>Can you post the below conversation to dev@tika and summarize it there. >>Also what happens if you tweak the definition of XHTML to not scan >>until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? >> >>Cheers, >>Chris >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Chris Mattmann, Ph.D. >>Chief Architect >>Instrument Software and Science Data Systems Section (398) NASA Jet >>Propulsion Laboratory Pasadena, CA 91109 USA >>Office: 168-519, Mailstop: 168-527 >>Email: [email protected] >>WWW: http://sunset.usc.edu/~mattmann/ >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Adjunct Associate Professor, Computer Science Department University of >>Southern California, Los Angeles, CA 90089 USA >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >>-----Original Message----- >>From: Luke <[email protected]> >>Date: Wednesday, April 22, 2015 at 12:19 AM >>To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U >>(3980-Affiliate)" <[email protected]>, Chris Mattmann >><[email protected]> >>Cc: "Bryant, Ann C (398G-Affiliate)" <[email protected]>, >>"Zimdars, Paul A (3980-Affiliate)" <[email protected]>, NSF >>Polar CyberInfrastructure DR Students >><[email protected]>, >>"[email protected]" <[email protected]> >>Subject: RE: [memex-jpl] this week action from luke >> >>>Hi Professor, >>>Please see attached jpg for the difference. >>>Thanks >>>Luke >>> >>>-----Original Message----- >>>From: Chris Mattmann [mailto:[email protected]] >>>Sent: Tuesday, April 21, 2015 5:27 PM >>>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)' >>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>>[email protected] >>>Subject: Re: [memex-jpl] this week action from luke >>> >>>Hey Luke what happens if you do java -jar /path/to/tika-app -m >>>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m >>>< /path/to/cbor/file.cbor any difference? >>> >>>------------------------ >>>Chris Mattmann >>>[email protected] >>> >>> >>> >>> >>>-----Original Message----- >>>From: Luke <[email protected]> >>>Date: Tuesday, April 21, 2015 at 5:41 PM >>>To: 'Luke' <[email protected]>, Chris Mattmann >>><[email protected]>, 'Giuseppe Totaro' >>><[email protected]>, Chris Mattmann >>><[email protected]> >>>Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, >>>"'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, >>>NSF Polar CyberInfrastructure DR Students >>><[email protected]>, >>><[email protected]> >>>Subject: RE: [memex-jpl] this week action from luke >>> >>>>Hi professor, >>>>I just sent a pull request for adding cbor extension. >>>>The interesting thing is that tika is still identifying the file >>>>dumped by the nutch dump tool as a "application/xhtml+xml" even when >>>>I manually change the file extension to the correct one (i.e. *.cbor ). >>>> >>>>The reason is probably that tika is identifying "application/xhtml+xml" >>>>by searching for the "<html" in the file content, PFA: >>>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by >>>>xhtml+nutch, >>>>you see that we do have that element as part of the cbor content >>>>because the entire crawled xhtml document seems to be imbedded in >>>>the cbor json(PFA: >>>>cbor.jpg); and also in Tika, the magic detection seems to have >>>>higher priority over the glob detection, thus the type is being >>>>incorrectly detected. >>>> >>>>Therefore, I would like to please mention that adding the entry of >>>><glob pattern="*.cbor"/> is not resolving the issue as of now >>>>without some fixed magic bytes / patterns for cbor. >>>>I also would like to add that the thing will be different with our >>>>probabilistic mime detection selector, because if we know that the >>>>file extension is more reliable than magic bytes, then we can >>>>certainly add more preferential weight to the extension... this also >>>>might show the current implementation with MimeTypes detection is a >>>>bit stiff or less flexible in this scneario. :) >>>> >>>> >>>>Thanks >>>>Luke >>>> >>>>-----Original Message----- >>>>From: Luke [mailto:[email protected]] >>>>Sent: Tuesday, April 21, 2015 12:14 PM >>>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)' >>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>>>'[email protected]' >>>>Subject: RE: [memex-jpl] this week action from luke >>>> >>>>Yes, let me add the cbor extension entry in tika xml, will send the >>>>pull request soon. >>>> >>>>Thanks >>>>Luke >>>>-----Original Message----- >>>>From: Chris Mattmann [mailto:[email protected]] >>>>Sent: Tuesday, April 21, 2015 6:51 AM >>>>To: Giuseppe Totaro; Mattmann, Chris A (3980) >>>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A >>>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students; >>>>[email protected] >>>>Subject: Re: [memex-jpl] this week action from luke >>>> >>>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER >>>>and tag along with adding an -extension command would be fantastic. >>>>Can you file both of those NUTCH issues, wait a day or so, and then >>>>based on feedback use your new Nutch commit karma to get those into >>>>Nutch? >>>> >>>>And then when creating the issues, can you link to the TIKA-1610 issue? >>>>At that point, when those two to be defined NUTCH issues are up, >>>>Luke, in parallel can you throw up a pull request/patch in Tika for >>>>the extension along with the MIME detection? >>>> >>>>Cheers, >>>>Chris >>>> >>>>------------------------ >>>>Chris Mattmann >>>>[email protected] >>>> >>>> >>>> >>>> >>>>-----Original Message----- >>>>From: Giuseppe Totaro <[email protected]> >>>>Date: Tuesday, April 21, 2015 at 12:33 PM >>>>To: Chris Mattmann <[email protected]> >>>>Cc: Luke <[email protected]>, Chris Mattmann >>>><[email protected]>, "Bryant, Ann C (398G-Affiliate)" >>>><[email protected]>, "Zimdars, Paul A (3980-Affiliate)" >>>><[email protected]>, NSF Polar CyberInfrastructure DR >>>>Students <[email protected]>, >>>>"[email protected]" >>>><[email protected]> >>>>Subject: Re: [memex-jpl] this week action from luke >>>> >>>>>Thanks Luke. Great work. >>>>>Chris, we wrap a single string value, representing the JSON text, >>>>>for each file into CBOR (by using serializeCBORData method). For >>>>>instance, using the Unix hex dump tool, we can see that, as >>>>>expected, the first byte of all files is "0x7F" (the first three >>>>>bits are "011", that is the major type for strings, and the >>>>>following 5 bits are "11010", meaning a uint32_t encodes the length >>>>>of following text), and the following 4 bytes (single-precision >>>>>float) encodes the right length of file (as described in RFC7049 >>>>><http://tools.ietf.org/html/rfc7049>). >>>>>Therefore, a CBOR tag is currently included into the file (a list >>>>>of cbor tags is available here >>>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>). >>>>>I did not know about CBOR "magic header". Thanks a lot Luke for >>>>>this great research. Chris, if you agree, I can add support for >>>>>prepending self-describing CBOR tag 55799 to CommonCrawldataDumper >>>>>class. I believe it is very easy because I have to enable the >>>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code >>>>>is available here >>>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/s >>>>>r >>>>>c >>>>>/ >>>>>m ain >>>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>). >>>>>Then, I can comment the TIKA-1610 >>>>><https://issues.apache.org/jira/browse/TIKA-1610> issue. >>>>> >>>>>Regarding the file extension, in the Memex CCA format the original >>>>>file extension is used. We could add support for a -extension >>>>>command-line option allowing the user to give a file extension >>>>>(e.g., >>>>>cbor) for all files dumped out. >>>>> >>>>>Thanks a lot, >>>>>Giuseppe >>>>> >>>>> >>>>> >>>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) >>>>><[email protected]> wrote: >>>>> >>>>>Thanks for this great research, Luke! >>>>> >>>>>Giuseppe, any idea why this tag doesn’t make it into the file? >>>>> >>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>Chris Mattmann, Ph.D. >>>>>Chief Architect >>>>>Instrument Software and Science Data Systems Section (398) NASA Jet >>>>>Propulsion Laboratory Pasadena, CA 91109 USA >>>>>Office: 168-519, Mailstop: 168-527 >>>>>Email: [email protected] >>>>>WWW: http://sunset.usc.edu/~mattmann/ >>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>Adjunct Associate Professor, Computer Science Department University >>>>>of Southern California, Los Angeles, CA 90089 USA >>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>-----Original Message----- >>>>>From: Luke <[email protected]> >>>>>Date: Tuesday, April 21, 2015 at 2:55 AM >>>>>To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U >>>>>(3980-Affiliate)" <[email protected]>, Chris Mattmann >>>>><[email protected]>, "Bryant, Ann C (398G-Affiliate)" >>>>><[email protected]>, "Zimdars, Paul A (3980-Affiliate)" >>>>><[email protected]>, NSF Polar CyberInfrastructure DR >>>>>Students <[email protected]>, >>>>>"[email protected]" >>>>><[email protected]> >>>>>Subject: RE: [memex-jpl] this week action from luke >>>>> >>>>>>Thanks professor. >>>>>>Hi professor and all. >>>>>>JIRA issue : CBOR Parser and detection improvement >>>>>>https://issues.apache.org/jira/browse/TIKA-1610 >>>>>> >>>>>>I tried to conduct a bit research with this cbor detection. >>>>>> >>>>>>It looks like there is a self describing tag that needs to be >>>>>>written in the cbor file thru which other applications might be >>>>>>able to identify the cbor type.... >>>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5 >>>>>> >>>>>>I don’t see that tag being present in the cbor file dumped by the >>>>>>nutch tool, I am not very sure though. >>>>>> >>>>>>Thanks >>>>>>Luke >>>>>> >>>>>> >>>>>> >>>>>>-----Original Message----- >>>>>>From: Chris Mattmann [mailto:[email protected]] >>>>>>Sent: Monday, April 20, 2015 4:10 AM >>>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C >>>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar >>>>>>CyberInfrastructure DR Students'; [email protected] >>>>>>Subject: Re: [memex-jpl] this week action from luke >>>>>> >>>>>>Nice one, Luke. If you have a second and you can open up an issue >>>>>>in Tika to make it support CBOR, then yes, by all means! :) >>>>>> >>>>>> >>>>>>------------------------ >>>>>>Chris Mattmann >>>>>>[email protected] >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>-----Original Message----- >>>>>>From: Luke <[email protected]> >>>>>>Date: Monday, April 20, 2015 at 4:15 AM >>>>>>To: 'Giuseppe Totaro' <[email protected]>, Chris Mattmann >>>>>><[email protected]>, Chris Mattmann >>>>>><[email protected]>, "'Bryant, Ann C (398G-Affiliate)'" >>>>>><[email protected]>, "'Zimdars, Paul A (3980-Affiliate)'" >>>>>><[email protected]>, NSF Polar CyberInfrastructure DR >>>>>>Students <[email protected]>, >>>>>><[email protected]> >>>>>>Subject: RE: [memex-jpl] this week action from luke >>>>>> >>>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit >>>>>>>of my confusion with the Nutch CommonCrawlDataDumper , appreciated. >>>>>>> >>>>>>>BTW, it looks like Tika might need to consider the support with >>>>>>>COBR parser and detection. >>>>>>>I checked the rfc, it looks like CBOR has not got magic numbers. >>>>>>>PFA: >>>>>>>rfc_cbor.jpg >>>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper >>>>>>>is not dumping the nutch segments with the .cbor extension, which >>>>>>>seems to be helpful for type detection. >>>>>>> >>>>>>>To professor Mattmann, >>>>>>>Tika does not support the detection of COBR, although the trunk >>>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in >>>>>>>the tika-mimetypes.xml, those entries are not detecting properly >>>>>>>the cobr files dumped by CommonCrawlDataDumper. Also CBOR does >>>>>>>not have magic bytes, off the top of my head the only way we can >>>>>>>detect it is using the extension, and content byte histogram >>>>>>>(please note, this is a local optimal solution and >>>>>>>data-dependent.) J >>>>>>> >>>>>>>I think I am bit deviating from the main route and discussion of >>>>>>>this thread…. i.e. the plan for testing the “probabilistic mime >>>>>>>detector selection” with polar data. >>>>>>>Anyway, I plan to repackage tika by incorporating the >>>>>>>probabilistic selection feature and replace the tika jar in nutch >>>>>>>with the repackaged one, and then run the CommonCrawlDataDumper >>>>>>>and see how it goes. If you have any specific ideas and thought >>>>>>>with the testing, please kindly let me know. >>>>>>> >>>>>>>Thanks >>>>>>>Luke >>>>>>> >>>>>>>From: Giuseppe Totaro [mailto:[email protected]] >>>>>>>Sent: Sunday, April 19, 2015 11:17 PM >>>>>>>To: Luke liu >>>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C >>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF >>>>>>>Polar CyberInfrastructure DR Students; [email protected] >>>>>>>Subject: Re: [memex-jpl] this week action from luke >>>>>>> >>>>>>> >>>>>>> >>>>>>>Hi Luke, >>>>>>> >>>>>>> >>>>>>>my name is Giuseppe and I am a PhD student working under the >>>>>>>supervision of Prof. Chris Mattmann. I worked on >>>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a >>>>>>>couple of your observations. My comments inline below. >>>>>>> >>>>>>> >>>>>>> >>>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu >>>>>>><[email protected]> ha >>>>>>>scritto: >>>>>>> >>>>>>> >>>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending >>>>>>>some time in understanding the code repo i.e. >>>>>>>http://github.com/chrismattmann/trec-dd-polar/ >>>>>>> >>>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is >>>>>>>dumping the crawl segments to json files with the human readable >>>>>>>and understandable content. >>>>>>>1) I am trying to run one of the commands on my side as shown in >>>>>>>gen-common-crawl.sh, but the generated files all end with .html >>>>>>>or htm; The command listed in gen-common-crawl.sh seems to be >>>>>>>allude to where the data is located on our >>>>>>>nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> >>>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not >>>>>>>exactly correct (probably they need to be updated), part of the >>>>>>>patterns was able to allow me to locate some similar datasets (e.g. >>>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped >>>>>>>files are all ending with html, but surprisingly inside those >>>>>>>outputted html files, the contents are present in json format; >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>The file extension is (almost) always the same as the original file. >>>>>>>More in detail, using the -epochFilename command-line option (as >>>>>>>in gen-common-crawl.sh), the scraped data will be stored with a >>>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, >>>>>>>where <filetype> is either the extension of the original file or >>>>>>>.html as default if the original file does not have an extension. >>>>>>>This schema is used for file naming and it does not depend on >>>>>>>internal output format (JSON). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>2) Another problem is that the root object is being set with some >>>>>>>garbled chars in each of the outputted json files (with extension >>>>>>>html in the end), PFA: garbled.jpg and one of the outputted json >>>>>>>file has been also attached as an example too (PFA: >>>>>>>1423894754000.html); the json files cannot be parsed properly by >>>>>>>aggregate.py due to those garbled chars. >>>>>>>Even if I get rid of those garbled chars, there are not mimeTypes >>>>>>>element which are being read by aggregate.py. >>>>>>> >>>>>>> >>>>>>> >>>>>>>Text content and metadata extracted from the crawled binary data >>>>>>>are stored in a structured document format (JSON). Furthermore, >>>>>>>this document is encoded using CBOR <http://cbor.io/> >>>>>>>serialization. Each not human-readable character that you notice >>>>>>>in front and at the end of JSON data is due to CBOR-encoding. >>>>>>>Thus, if you need to read JSON data from document dumped out by >>>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded >>>>>>>data structure inside the file. >>>>>>> >>>>>>> >>>>>>> >>>>>>>I hope this short overview can help in you work. I really >>>>>>>appreciate your feedback and, by the way, thanks a lot for your >>>>>>>great job in detection. >>>>>>> >>>>>>>I am available to provide you all support I can give, so you do >>>>>>>not hesitate to contact me if you may need any further information. >>>>>>> >>>>>>> >>>>>>> >>>>>>>Thanks, >>>>>>> >>>>>>>Giuseppe >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>Finally, after some research, I guess that the statistical >>>>>>>information (present in the readme of the code repo) is not being >>>>>>>collected and computed by aggregate.py from those output json >>>>>>>files but it looks like it is coming from the log.... see the >>>>>>>following as an example: >>>>>>> >>>>>>>2015-04-19 04:55:42,078 INFO tools.CommonCrawlDataDumper - >>>>>>>CommonsCrawlDataDumper File Stats: >>>>>>>TOTAL Stats: >>>>>>>[ >>>>>>> {"mimeType":"application/x-tika-msoffice","count":"17"} >>>>>>> {"mimeType":"application/vnd.ms-excel","count":"7"} >>>>>>> {"mimeType":"application/xhtml+xml","count":"3000"} >>>>>>> {"mimeType":"application/octet-stream","count":"641"} >>>>>>> {"mimeType":"application/epub+zip","count":"1"} >>>>>>> {"mimeType":"application/zip","count":"6"} >>>>>>> {"mimeType":"application/xml","count":"11"} >>>>>>> {"mimeType":"image/png","count":"110"} >>>>>>> {"mimeType":"image/jpeg","count":"70"} >>>>>>> {"mimeType":"application/atom+xml","count":"213"} >>>>>>> {"mimeType":"application/rss+xml","count":"43"} >>>>>>> {"mimeType":"video/mp4","count":"3"} >>>>>>> {"mimeType":"text/plain","count":"104"} >>>>>>> {"mimeType":"application/rdf+xml","count":"2"} >>>>>>> {"mimeType":"image/gif","count":"2"} >>>>>>> {"mimeType":"text/x-php","count":"1"} >>>>>>> {"mimeType":"video/x-msvideo","count":"1"} >>>>>>> {"mimeType":"application/x-tika-ooxml","count":"3"} >>>>>>> {"mimeType":"text/html","count":"9506"} >>>>>>> {"mimeType":"application/pdf","count":"280"} >>>>>>>] >>>>>>> >>>>>>>It turns out that aggregate.py is not the one that produces the >>>>>>>statistical information, not sure what it does... but anyway, I >>>>>>>think I understand the whole idea and I do concur with it, might >>>>>>>be we can repackage the tika by incorporating the feature (i.e. >>>>>>>probabilistic mime >>>>>>>selection) in it and see if it can output the same information as >>>>>>>the one without it in the log. >>>>>>> >>>>>>>BTW, Regarding the use of the feature with probabilistic mime >>>>>>>selection: >>>>>>>in my pull request, I added a simple test case which might tell a >>>>>>>bit more about how the feature is called and used, it is simple >>>>>>>though. >>>>>>>Here is an example snippet >>>>>>> ProbabilisticMimeDetectionSelector probSel = new >>>>>>>ProbabilisticMimeDetectionSelector(); >>>>>>> probSel.detect(input::InputStream, metadata:: >>>>>>>Metadata) It is similar to MimeTypes::detect(...) (more >>>>>>>information with this can be found in >>>>>>>https://issues.apache.org/jira/browse/TIKA-1517) >>>>>>>Now, in order to allow the Tika().detect() to call the >>>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as >>>>>>>Tika().detect() is being called by commoncrawldump), we need to >>>>>>>modify/add some code in the TikaConfig which initializes a list >>>>>>>of default detectors, and we need to get rid of the detector - >>>>>>>mimeTypes:: >>>>>>>MimeTypes in the list and replace it with probSel:: >>>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create >>>>>>>another pull request with this change for >>>>>>>TikaConfig) >>>>>>> >>>>>>>I think that is all of my initial thought with some finding and >>>>>>>plan; if you have anything you would like to please add and >>>>>>>comment, please do kindly let me know, then I will start working >>>>>>>on my 'finale'. BTW, don’t worry, even after I am graduated, the >>>>>>>graduation is not my termination with tika and this project, >>>>>>>after then I still can and want to help this polar project and >>>>>>>tika as much as possible, and correct the programming faults and >>>>>>>bugs, respond to the tika issues ,etc. >>>>>>> >>>>>>> >>>>>>> >>>>>>>Thanks >>>>>>>Luke >>>>>>> >>>>>>>-----Original Message----- >>>>>>>From: Chris Mattmann [mailto:[email protected]] >>>>>>>Sent: Saturday, April 18, 2015 6:26 AM >>>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C >>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate) >>>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; >>>>>>>[email protected] >>>>>>>Subject: Re: this week action from luke >>>>>>>Importance: High >>>>>>> >>>>>>>Awesome Luke. I am going to work specifically on now benchmarking >>>>>>>your code in real situations. For example, it would be fantastic >>>>>>>to now run your Bayesian MIME detector over the whole NSF TREC >>>>>>>Dynamic Domain data for Polar described here: >>>>>>> >>>>>>>http://github.com/chrismattmann/trec-dd-polar/ >>>>>>> >>>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and >>>>>>>Annie can explain it, also CC’ed. >>>>>>> >>>>>>>Can we make that your goal for the next 2 weeks to actually test >>>>>>>it and produce a real result over the whole TREC-DD data for >>>>>>>Polar? My goal will be to get your code committed and integrated >>>>>>>into Tika. >>>>>>>The more you can write me a guide of how to build and test your >>>>>>>code with Tika so I can get it committed the better. >>>>>>> >>>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is >>>>>>>building a Bayesian MIME classifier to evaluate against Tika’s >>>>>>>existing MIME detection approach. If folks have any Memex needs >>>>>>>to try and test more accurate file identification with Tika, Luke >>>>>>>is the guy to talk to and I have him for 2 more weeks. >>>>>>> >>>>>>>Thanks! >>>>>>> >>>>>>>Cheers, >>>>>>>Chris >>>>>>> >>>>>>>------------------------ >>>>>>>Chris Mattmann >>>>>>>[email protected] >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>-----Original Message----- >>>>>>>From: Luke liu <[email protected]> >>>>>>>Date: Thursday, April 16, 2015 at 11:29 PM >>>>>>>To: Chris Mattmann <[email protected]>, Chris Mattmann >>>>>>><[email protected]> >>>>>>>Cc: 'Luke' <[email protected]> >>>>>>>Subject: this week action from luke >>>>>>> >>>>>>> >>>>>>> >>>>>>>Hi Professor Mattmann, >>>>>>> >>>>>>>I think I am in the final phase of the research, and last week I >>>>>>>finished the last item in the list, and hopefully everything will >>>>>>>be fine. >>>>>>> >>>>>>>For now, i probably can spend some time in verifying or >>>>>>>optimizing the codes, the majority of the research has been >>>>>>>done…and it will be also great if you can please comment on my >>>>>>>work (the 2 pull >>>>>>>requests) when you have time. >>>>>>> >>>>>>>If you do have confusion with any of my work, please also do let >>>>>>>me know. >>>>>>> >>>>>>>Thanks and I am glad working with you, for the next a couple of >>>>>>>weeks before graduation, I am going to continue revising and >>>>>>>testing the code and features to get rid of some flaws (if any >>>>>>>)when I have time. >>>>>>> >>>>>>>Not sure if I miss out something, and if I do miss some thing >>>>>>>important, please do let me know too. >>>>>>> >>>>>>>Thanks >>>>>>>Luke >>>>>>> >>>>>>> >>>>>>>-- >>>>>>>You received this message because you are subscribed to the >>>>>>>Google Groups "JPL-Kitware-Continuum Memex Group" group. >>>>>>>To unsubscribe from this group and stop receiving emails from it, >>>>>>>send an email to [email protected] >>>>>>><mailto:memex-jpl%[email protected]>. >>>>>>>To post to this group, send email to [email protected]. >>>>>>>Visit this group at http://groups.google.com/group/memex-jpl. >>>>>>>To view this discussion on the web visit >>>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b35100 >>>>>>>7 >>>>>>>0 >>>>>>>% >>>>>>>2 >>>>>>>41 >>>>>>>9f3 >>>>>>>0150%24%40edu. >>>>>>>For more options, visit https://groups.google.com/d/optout. >>>>>>><garbled.jpg><1423894754000.html> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >> > >
