Chris and all,


I think we're good to go for rc2.  Shouldn't have been so optimistic on time 
estimate. I'm sorry it took so long.



I just finished the preliminary analyses.



Many thanks to Nick for figuring out that the new eval code was hotter off the 
press than it should have been. :)



Details on runs against govdocs1



10 "fixed" exceptions

4 "new" exceptions



More attachments in a handful of .doc files.

Metadata values roughly equivalent.



Changes in mime detection for "main" files:
text/plain; charset=ISO-8859-1->application/x-bibtex-text-file

11

text/dif+xml->application/dif+xml

9

text/plain; charset=windows-1252->application/pdf

4

text/plain; charset=windows-1252->application/x-bibtex-text-file

3

text/plain; charset=windows-1255->application/pdf

3

text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1

2

text/html; charset=ISO-8859-1->application/x-bibtex-text-file

1




Changes in mime detection for embedded files:
CONCAT(DETECTED_CONTENT_TYPE_A, '->', DETECTED_CONTENT_TYPE_B)  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

application/x-msmetafile->application/pdf

42

image/x-pict->application/pdf

28

application/x-msmetafile->application/zlib

5

application/x-emf->application/zlib

2

application/octet-stream->application/zlib

1




Nick has already opened an issue to look into the extra wrapping around pdfs.



It looks like the change of the magic range for pdfs was a good move (for 
govdocs1, at least).  However, we’re now losing content from those files that 
are now identified as bibtex.



ONE BIG AREA FOR FURTHER ANALYSIS

In the earlier eval code, we had one row per input/parent/container document.  
I’ve modified it so that we now have one row per document, whether it is an 
attachment/embedded document or a parent/container document.  I’m also now 
storing the stacktraces from the embedded documents.



For govdocs1, we’re now at 6,653 “caught” exceptions for container documents 
(out of 979,143=0.7%), but we have roughly 33k exceptions for embedded 
documents out of 1,364,552=2.4%).  As before, I need to confirm that something 
didn’t go wrong with my code; it could also be the case that the files are 
being mis-id’d as Excel… For now, though, it looks like that high # is driven 
by embedded Excel files.



Compare exceptions for container files:
DETECTED_CONTENT_TYPE_B  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

application/xml

1781

application/vnd.ms-powerpoint

968

application/msword

681

application/vnd.ms-excel

272

application/pdf

107

application/vnd.google-earth.kml+xml

19

image/jpeg

5

application/x-tika-msoffice

5

text/plain; charset=ISO-8859-1

5

application/vnd.ms-excel.sheet.3

4

application/vnd.openxmlformats-officedocument.presentationml.presentation

3

application/vnd.ms-excel.sheet.4

3

application/xhtml+xml; charset=UTF-8

2

application/rdf+xml

2

image/vnd.dwg

2

application/rtf

2

application/rss+xml

1

application/dita+xml; format=topic

1

application/vnd.openxmlformats-officedocument.wordprocessingml.document

1

text/html; charset=windows-1252

1

text/html; charset=ISO-8859-1

1




With exceptions for embedded files:
DETECTED_CONTENT_TYPE_B  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

application/vnd.ms-excel

26008

image/png

2184

application/vnd.visio

2019

image/x-ms-bmp

1899

image/jpeg

311

image/vnd.dwg

51

application/x-font-ttf

40

image/vnd.adobe.photoshop

21

application/x-tika-msoffice

14

application/vnd.google-earth.kml+xml

13

application/vnd.ms-powerpoint

12

null

9

application/xml

3

application/msword

2

application/pdf

1






Top 10 file types overall for parent/container files
DETECTED_CONTENT_TYPE_B  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

application/pdf

230860

image/jpeg

109282

text/html; charset=ISO-8859-1

86344

application/msword

76984

text/plain; charset=ISO-8859-1

64186

application/vnd.ms-excel

58930

application/vnd.ms-powerpoint

51169

text/html; charset=windows-1252

50292

text/plain; charset=windows-1252

42988

text/html; charset=UTF-8

41003




Top 10 files types for embedded files
DETECTED_CONTENT_TYPE_B  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  
<http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

image/png

470639

image/jpeg

261050

application/x-msmetafile

231803

application/x-emf

103058

application/x-tika-msoffice

82964

application/vnd.ms-excel

54296

image/x-pict

29764

application/msword

22699

image/x-ms-bmp

21525

application/x-tika-msoffice-embedded; format=ole10_native

19751






For anyone who wants to kick the tires on the apparent embedded excel issue, 
here are some files sorted by asc order of the json length:


428/428996.ppt

428996.ppt/11

application/vnd.ms-excel

6050

920/920182.ppt

920182.ppt/2

application/vnd.ms-excel

6093

852/852522.ppt

852522.ppt/758

application/vnd.ms-excel

6110

851/851799.ppt

851799.ppt/789

application/vnd.ms-excel

6110

854/854876.ppt

854876.ppt/696

application/vnd.ms-excel

6112

703/703075.ppt

703075.ppt/830

application/vnd.ms-excel

6112

849/849126.ppt

849126.ppt/880

application/vnd.ms-excel

6114

849/849621.ppt

849621.ppt/861

application/vnd.ms-excel

6116

847/847762.ppt

847762.ppt/992

application/vnd.ms-excel

6119




Looks like the majority are embedded in ppt, but there are several embedded in 
xls as well.



Cheers,



          Tim

-----Original Message-----
From: Allison, Timothy B. [mailto:[email protected]]
Sent: Wednesday, June 03, 2015 8:46 PM
To: [email protected]
Subject: RE: [DISCUSS] 1.9 Tika release?



Fixed eval code, thanks to Nick.



Now running against doc/x list fixes to confirm success.



Will rerun tomorrow on full set, with results by noon ETD.



-----Original Message-----

From: Allison, Timothy B. [mailto:[email protected]]

Sent: Wednesday, June 03, 2015 7:28 AM

To: [email protected]<mailto:[email protected]>

Subject: RE: [DISCUSS] 1.9 Tika release?



Y.  Will do.



-----Original Message-----

From: Mattmann, Chris A (3980) [mailto:[email protected]]

Sent: Tuesday, June 02, 2015 10:31 PM

To: [email protected]<mailto:[email protected]>

Subject: Re: [DISCUSS] 1.9 Tika release?



Thanks Tim. Looks like you and Nick have been fixing some other

stuff too.



Let me know when I should spin RC #2. Ready and willing! :-)



Cheers,

Chris



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Chris Mattmann, Ph.D.

Chief Architect

Instrument Software and Science Data Systems Section (398)

NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

Office: 168-519, Mailstop: 168-527

Email: [email protected]<mailto:[email protected]>

WWW:  http://sunset.usc.edu/~mattmann/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Adjunct Associate Professor, Computer Science Department

University of Southern California, Los Angeles, CA 90089 USA

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++









-----Original Message-----

From: <Allison>, "Timothy B." <[email protected]<mailto:[email protected]>>

Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>

Date: Monday, June 1, 2015 at 6:53 PM

To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>

Subject: RE: [DISCUSS] 1.9 Tika release?



>Thank you.  Sorry about this.  I had hoped to run against govdocs1

>soonish to look for these before an rc was cut.

>

>There were three critical points in the code that accounted for all of

>the exceptions:

>

>1) one where I had left in a debugging RuntimeException throw instead of

>a swallow/return ""

>2) one where poi fairly commonly throws a RuntimeException when something

>goes wrong in paragraph.getList()

>3) failure of imagination that a value might not be found in XWPF's

>numbering, which led to NPE

>

>These are all fixed locally.  Will rerun over night with results tomorrow.

>

>-----Original Message-----

>From: Mattmann, Chris A (3980) [mailto:[email protected]]

>Sent: Monday, June 01, 2015 9:06 PM

>To: [email protected]<mailto:[email protected]>

>Subject: Re: [DISCUSS] 1.9 Tika release?

>

>ACK nw, will be happy to spin RC #2 when we’re ready :)

>

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>Chris Mattmann, Ph.D.

>Chief Architect

>Instrument Software and Science Data Systems Section (398)

>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

>Office: 168-519, Mailstop: 168-527

>Email: [email protected]<mailto:[email protected]>

>WWW:  http://sunset.usc.edu/~mattmann/

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>Adjunct Associate Professor, Computer Science Department

>University of Southern California, Los Angeles, CA 90089 USA

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

>

>

>

>-----Original Message-----

>From: <Allison>, "Timothy B." <[email protected]<mailto:[email protected]>>

>Reply-To: "[email protected]<mailto:[email protected]>" 
><[email protected]<mailto:[email protected]>>

>Date: Monday, June 1, 2015 at 6:04 PM

>To: "[email protected]<mailto:[email protected]>" 
><[email protected]<mailto:[email protected]>>

>Subject: RE: [DISCUSS] 1.9 Tika release?

>

>>-1

>>

>>Do not release because there are roughly 200 new exceptions in govdocs1

>>caused by the list processing code I just added to doc and docx files.

>>

>>Argh...

>>

>>Will fix asap.

>>

>>-----Original Message-----

>>From: Allison, Timothy B. [mailto:[email protected]]

>>Sent: Monday, June 01, 2015 7:03 AM

>>To: [email protected]<mailto:[email protected]>

>>Subject: RE: [DISCUSS] 1.9 Tika release?

>>

>>Will run rc against govdocs1 and commoncrawl to see what I find.  Results

>>by tomorrow.

>>

>>Thank you, Chris!

>>

>>-----Original Message-----

>>From: Mattmann, Chris A (3980) [mailto:[email protected]]

>>Sent: Sunday, May 31, 2015 2:52 PM

>>To: [email protected]<mailto:[email protected]>

>>Subject: [DISCUSS] 1.9 Tika release?

>>

>>Hey Folks,

>>

>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,

>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.

>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT

>>0.9 RC right now too and am in the mood).

>>

>>Please feel free to VOTE with your feet and I’m happy to cut more than

>>1 RC if I missed anything of course.

>>

>>Cheers,

>>Chris

>>

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>>Chris Mattmann, Ph.D.

>>Chief Architect

>>Instrument Software and Science Data Systems Section (398)

>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

>>Office: 168-519, Mailstop: 168-527

>>Email: [email protected]<mailto:[email protected]>

>>WWW:  http://sunset.usc.edu/~mattmann/

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>>Adjunct Associate Professor, Computer Science Department

>>University of Southern California, Los Angeles, CA 90089 USA

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>>

>>

>


Reply via email to