Chris and all,
I think we're good to go for rc2. Shouldn't have been so optimistic on time estimate. I'm sorry it took so long. I just finished the preliminary analyses. Many thanks to Nick for figuring out that the new eval code was hotter off the press than it should have been. :) Details on runs against govdocs1 10 "fixed" exceptions 4 "new" exceptions More attachments in a handful of .doc files. Metadata values roughly equivalent. Changes in mime detection for "main" files: text/plain; charset=ISO-8859-1->application/x-bibtex-text-file 11 text/dif+xml->application/dif+xml 9 text/plain; charset=windows-1252->application/pdf 4 text/plain; charset=windows-1252->application/x-bibtex-text-file 3 text/plain; charset=windows-1255->application/pdf 3 text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1 2 text/html; charset=ISO-8859-1->application/x-bibtex-text-file 1 Changes in mime detection for embedded files: CONCAT(DETECTED_CONTENT_TYPE_A, '->', DETECTED_CONTENT_TYPE_B) <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> application/x-msmetafile->application/pdf 42 image/x-pict->application/pdf 28 application/x-msmetafile->application/zlib 5 application/x-emf->application/zlib 2 application/octet-stream->application/zlib 1 Nick has already opened an issue to look into the extra wrapping around pdfs. It looks like the change of the magic range for pdfs was a good move (for govdocs1, at least). However, we’re now losing content from those files that are now identified as bibtex. ONE BIG AREA FOR FURTHER ANALYSIS In the earlier eval code, we had one row per input/parent/container document. I’ve modified it so that we now have one row per document, whether it is an attachment/embedded document or a parent/container document. I’m also now storing the stacktraces from the embedded documents. For govdocs1, we’re now at 6,653 “caught” exceptions for container documents (out of 979,143=0.7%), but we have roughly 33k exceptions for embedded documents out of 1,364,552=2.4%). As before, I need to confirm that something didn’t go wrong with my code; it could also be the case that the files are being mis-id’d as Excel… For now, though, it looks like that high # is driven by embedded Excel files. Compare exceptions for container files: DETECTED_CONTENT_TYPE_B <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> application/xml 1781 application/vnd.ms-powerpoint 968 application/msword 681 application/vnd.ms-excel 272 application/pdf 107 application/vnd.google-earth.kml+xml 19 image/jpeg 5 application/x-tika-msoffice 5 text/plain; charset=ISO-8859-1 5 application/vnd.ms-excel.sheet.3 4 application/vnd.openxmlformats-officedocument.presentationml.presentation 3 application/vnd.ms-excel.sheet.4 3 application/xhtml+xml; charset=UTF-8 2 application/rdf+xml 2 image/vnd.dwg 2 application/rtf 2 application/rss+xml 1 application/dita+xml; format=topic 1 application/vnd.openxmlformats-officedocument.wordprocessingml.document 1 text/html; charset=windows-1252 1 text/html; charset=ISO-8859-1 1 With exceptions for embedded files: DETECTED_CONTENT_TYPE_B <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> application/vnd.ms-excel 26008 image/png 2184 application/vnd.visio 2019 image/x-ms-bmp 1899 image/jpeg 311 image/vnd.dwg 51 application/x-font-ttf 40 image/vnd.adobe.photoshop 21 application/x-tika-msoffice 14 application/vnd.google-earth.kml+xml 13 application/vnd.ms-powerpoint 12 null 9 application/xml 3 application/msword 2 application/pdf 1 Top 10 file types overall for parent/container files DETECTED_CONTENT_TYPE_B <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> application/pdf 230860 image/jpeg 109282 text/html; charset=ISO-8859-1 86344 application/msword 76984 text/plain; charset=ISO-8859-1 64186 application/vnd.ms-excel 58930 application/vnd.ms-powerpoint 51169 text/html; charset=windows-1252 50292 text/plain; charset=windows-1252 42988 text/html; charset=UTF-8 41003 Top 10 files types for embedded files DETECTED_CONTENT_TYPE_B <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f> image/png 470639 image/jpeg 261050 application/x-msmetafile 231803 application/x-emf 103058 application/x-tika-msoffice 82964 application/vnd.ms-excel 54296 image/x-pict 29764 application/msword 22699 image/x-ms-bmp 21525 application/x-tika-msoffice-embedded; format=ole10_native 19751 For anyone who wants to kick the tires on the apparent embedded excel issue, here are some files sorted by asc order of the json length: 428/428996.ppt 428996.ppt/11 application/vnd.ms-excel 6050 920/920182.ppt 920182.ppt/2 application/vnd.ms-excel 6093 852/852522.ppt 852522.ppt/758 application/vnd.ms-excel 6110 851/851799.ppt 851799.ppt/789 application/vnd.ms-excel 6110 854/854876.ppt 854876.ppt/696 application/vnd.ms-excel 6112 703/703075.ppt 703075.ppt/830 application/vnd.ms-excel 6112 849/849126.ppt 849126.ppt/880 application/vnd.ms-excel 6114 849/849621.ppt 849621.ppt/861 application/vnd.ms-excel 6116 847/847762.ppt 847762.ppt/992 application/vnd.ms-excel 6119 Looks like the majority are embedded in ppt, but there are several embedded in xls as well. Cheers, Tim -----Original Message----- From: Allison, Timothy B. [mailto:[email protected]] Sent: Wednesday, June 03, 2015 8:46 PM To: [email protected] Subject: RE: [DISCUSS] 1.9 Tika release? Fixed eval code, thanks to Nick. Now running against doc/x list fixes to confirm success. Will rerun tomorrow on full set, with results by noon ETD. -----Original Message----- From: Allison, Timothy B. [mailto:[email protected]] Sent: Wednesday, June 03, 2015 7:28 AM To: [email protected]<mailto:[email protected]> Subject: RE: [DISCUSS] 1.9 Tika release? Y. Will do. -----Original Message----- From: Mattmann, Chris A (3980) [mailto:[email protected]] Sent: Tuesday, June 02, 2015 10:31 PM To: [email protected]<mailto:[email protected]> Subject: Re: [DISCUSS] 1.9 Tika release? Thanks Tim. Looks like you and Nick have been fixing some other stuff too. Let me know when I should spin RC #2. Ready and willing! :-) Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected]<mailto:[email protected]> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: <Allison>, "Timothy B." <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Monday, June 1, 2015 at 6:53 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: RE: [DISCUSS] 1.9 Tika release? >Thank you. Sorry about this. I had hoped to run against govdocs1 >soonish to look for these before an rc was cut. > >There were three critical points in the code that accounted for all of >the exceptions: > >1) one where I had left in a debugging RuntimeException throw instead of >a swallow/return "" >2) one where poi fairly commonly throws a RuntimeException when something >goes wrong in paragraph.getList() >3) failure of imagination that a value might not be found in XWPF's >numbering, which led to NPE > >These are all fixed locally. Will rerun over night with results tomorrow. > >-----Original Message----- >From: Mattmann, Chris A (3980) [mailto:[email protected]] >Sent: Monday, June 01, 2015 9:06 PM >To: [email protected]<mailto:[email protected]> >Subject: Re: [DISCUSS] 1.9 Tika release? > >ACK nw, will be happy to spin RC #2 when we’re ready :) > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected]<mailto:[email protected]> >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > >-----Original Message----- >From: <Allison>, "Timothy B." <[email protected]<mailto:[email protected]>> >Reply-To: "[email protected]<mailto:[email protected]>" ><[email protected]<mailto:[email protected]>> >Date: Monday, June 1, 2015 at 6:04 PM >To: "[email protected]<mailto:[email protected]>" ><[email protected]<mailto:[email protected]>> >Subject: RE: [DISCUSS] 1.9 Tika release? > >>-1 >> >>Do not release because there are roughly 200 new exceptions in govdocs1 >>caused by the list processing code I just added to doc and docx files. >> >>Argh... >> >>Will fix asap. >> >>-----Original Message----- >>From: Allison, Timothy B. [mailto:[email protected]] >>Sent: Monday, June 01, 2015 7:03 AM >>To: [email protected]<mailto:[email protected]> >>Subject: RE: [DISCUSS] 1.9 Tika release? >> >>Will run rc against govdocs1 and commoncrawl to see what I find. Results >>by tomorrow. >> >>Thank you, Chris! >> >>-----Original Message----- >>From: Mattmann, Chris A (3980) [mailto:[email protected]] >>Sent: Sunday, May 31, 2015 2:52 PM >>To: [email protected]<mailto:[email protected]> >>Subject: [DISCUSS] 1.9 Tika release? >> >>Hey Folks, >> >>There’s been lots of new Tika goodness coming with the GeoTopic stuff, >>the ExternParser fixes, and also with FFMPEG and EXIF improvements. >>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT >>0.9 RC right now too and am in the mood). >> >>Please feel free to VOTE with your feet and I’m happy to cut more than >>1 RC if I missed anything of course. >> >>Cheers, >>Chris >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Chris Mattmann, Ph.D. >>Chief Architect >>Instrument Software and Science Data Systems Section (398) >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>Office: 168-519, Mailstop: 168-527 >>Email: [email protected]<mailto:[email protected]> >>WWW: http://sunset.usc.edu/~mattmann/ >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Adjunct Associate Professor, Computer Science Department >>University of Southern California, Los Angeles, CA 90089 USA >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >
