Y, my suspicion holds up.  If you look at TOP_10_UNIQUE_TOKEN_DIFFS_A
in content_diffs_with_exceptions.xlsx, there aren't any unique words
we were extracting with 4.0.0 that we're not extracting with 4.0.1 in
the vast majority of ppt files.  Note, too, that while the number of
tokens differs, the number of unique tokens does not...for the
majority of ppt.

It looks like we have lost some content docx template files, e.g.:
http://162.242.228.174/docs/commoncrawl2/KQ/KQQ5VZ6BBBRCZPY4GDUIEMVPSGABOMM4

We used to get 17 unique words from this, and we now get just
1...we've lost: de: 2 | la: 2 | 03: 1 | 06: 1 | 1: 1 | 16: 1 | 2009: 1
| 3: 1 | conciencia: 1 | despertar: 1

These were in the header...I have to step away from the keyboard for
now...any ideas?
On Wed, Nov 21, 2018 at 12:37 PM Tim Allison <talli...@apache.org> wrote:
>
> Reports are available here:
> http://162.242.228.174/reports/reports_poi_4_0_1-rc1.tgz
>
> We have a bunch less content in ppt, but I _think_ this is because at
> the Tika level we used to duplicate notes content, and we've fixed
> that bug.  So, I think this is an improvement, but I need to check.
> On Wed, Nov 21, 2018 at 12:05 PM Andreas Beeker <kiwiwi...@apache.org> wrote:
> >
> > On 21.11.18 10:47, pj.fanning wrote:
> > > I found a few missing classes in poi-ooxml-schemas.jar.
> >
> > Is this now a "-1", i.e. a must-have otherwise we get a lot of 
> > stackoverflow messages complaining about it
> >
> > ... or a "0-", i.e. nice-to-have, but until 4.0.2 is out, the users can use 
> > the full-schema
> >
> >
> > I'm asking about this, as there were already a few changes to the trunk 
> > since I've provided the RC and we might have to do another Tika- / POI- 
> > common crawl run again... which I would like to avoid.
> >
> > Andi
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
> > For additional commands, e-mail: dev-h...@poi.apache.org
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to