RE: Tika 1.15

Allison, Timothy B. Tue, 02 May 2017 05:05:31 -0700

The other two critical files:

Content/common_token_comparisons_by_mime.xlsx
Content/content_diffs_ignore_exceptions.xlsx



Oh, and the key part, which is less than ideal, is that there has to be a human 
in the loop...which makes the need for visualizations even more critical.

For example:

1) We now have more exceptions in file type y.  Well, that's ok because we 
didn't have a parser for file type y before.  

2) We have fewer exceptions in file type x; that should be good, right?  Well, 
no, because now there are far fewer "common words" in x, which means that the 
parser became less restrictive and sloppier.  We now have more noise.

3) We now have more "common words" in file type x; that should be a sign of 
improvement, right?  Not necessarily, because:
        a) we failed to remove a few common html markup terms and our html 
parser/detection is failing so we have a bunch more "span" and "body" words.  
That's bad.  (We can fix this as we go forward)
        b) our parsers are repeating sections now.  Doh! (We can fix this with 
better statistics).
        c) our OCR is hallucinating common words because we're using a heavily 
dictionary-biased OCR system.  (unlikely, but possible)

The lists go on...

In short, my original vision of nightly automated tests has had a run in with 
reality and lost.  A human has to make sense of the output/db.

My dumping some reports to xlsx yields good data for the developer who wrote 
the code, but, I agree, they are largely incomprehensible to someone getting 
started.

So, please, help!



-----Original Message-----
From: Tyler Bui-Palsulich [mailto:[email protected]] 
Sent: Monday, May 1, 2017 11:39 PM
To: [email protected]
Subject: RE: Tika 1.15

How exactly did you "evaluate" the results? I opened the zip and looked at a 
few of the sheets, but it's a bit daunting.

Any way we could dump JSON? That's a bit easier to build visualizations for.

Tyler

On May 1, 2017 3:59 PM, "Allison, Timothy B." <[email protected]> wrote:

> Sounds good.  W00t!
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[email protected]]
> Sent: Monday, May 1, 2017 4:57 PM
> To: [email protected]
> Subject: Re: Tika 1.15
>
> Thanks Tim. I am going to try and get tika-dl added (if possible), and 
> also try the Sentiment Parser next. If I can get one or both of those 
> (in the next day or so), then I will give you the heads up to begin testing.
> Video recognition is in!
>
>
>
>
>
> On 5/1/17, 12:42 PM, "Allison, Timothy B." <[email protected]> wrote:
>
>     I finally had a chance to look through the results of the first 
> regression run.
>
>     I made a few trivial changes to our parsers and to tika-eval.
>
>     We appear to have many more exceptions in files parsed by our 
> CompressorParser, but this is because of reporting...not because of 
> reality
> -- the exception is now coming in the container file, not an 
> attachment...and tika-eval wasn't matching A and B correctly.
>
>     There is a regression that's been fixed in PDFBox trunk 
> (PDFBOX-3717), but I don't see that as a blocker.
>
>     We have new exceptions in the new parsers, EMF, WMF, .xlsb, 
> wordperfect, but that's because we're actually parsing those now. :)
>
>     All else looks to be in decent shape.
>
>     Chris and Team and All,
>       Let me know when you're ready for me to kick off the next 
> regression run.
>
>               Cheers,
>
>                       Tim
>
>
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (3010) [mailto:[email protected]]
>     Sent: Wednesday, April 26, 2017 12:48 PM
>     To: [email protected]
>     Subject: Re: Tika 1.15
>
>     Thank you!
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Principal Data Scientist, Engineering Administrative Office (3010) 
> Manager, NSF & Open Source Projects Formulation and Development 
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-503
>     Email: [email protected]
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS) 
> Adjunct Associate Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[email protected]> wrote:
>
>         Oh.  Ok.  Will wait, then?
>
>         -----Original Message-----
>         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>         Sent: Wednesday, April 26, 2017 11:38 AM
>         To: [email protected]
>         Subject: Re: Tika 1.15
>
>         I want to see if I can get in the VideoRecognition parser, and 
> also the Sentiment one.
>
>         I hope to get it done in the next day or so. Thanks.
>
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Chris Mattmann, Ph.D.
>         Principal Data Scientist, Engineering Administrative Office 
> (3010) Manager, NSF & Open Source Projects Formulation and Development 
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>         Office: 180-503E, Mailstop: 180-503
>         Email: [email protected]
>         WWW:  http://sunset.usc.edu/~mattmann/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Director, Information Retrieval and Data Science Group (IRDS) 
> Adjunct Associate Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
>         WWW: http://irds.usc.edu/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>         On 4/26/17, 7:54 AM, "Allison, Timothy B." 
> <[email protected]>
> wrote:
>
>             With the added TSD parser, I think I should rerun the 
> regression testing.  Given that, I also fixed 2099, and we'll benefit 
> from a rerun.
>
>             Anything else before I rerun the regression testing?
>
>             Any problems observed in first run?
>
>
>
>
>
>
>
>
>

RE: Tika 1.15

Reply via email to