Re: 1.20?

2018-12-18 Thread Tim Allison
Reports on mp4s, junrar, msaccess and a random subset of the
regression corpus are available here:
http://162.242.228.174/reports/reports_tika_1_20-rc1_subset.tgz


On Thu, Dec 13, 2018 at 5:34 PM Tim Allison  wrote:
>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison  wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif  
>> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  
>> > > > > wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds 
>> > > > > from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >


Re: 1.20?

2018-12-14 Thread Tim Allison
Thank you, again, Luís Filipe Nassif!  There's no point in having
reports unless we pay attention to them :P.  I reverted junrar to
where it was in 1.19.1. I also reverted jackcess based on the reports.

All,
  On the theory that it isn't a great idea to push to production on a
Friday.  I'm going to let the recent changes rest over the weekend.
I'll rerun some tests on a subset of the regression corpus on Monday
and then roll rc1.  If anyone wants to kick the tires on the recent
version changes, including parsers that depend on the upgraded guava,
that'd be great!

Onward!

Cheers,

   Tim

On Thu, Dec 13, 2018 at 5:34 PM Tim Allison  wrote:
>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison  wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif  
>> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  
>> > > > > wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds 
>> > > > > from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >


Re: 1.20?

2018-12-13 Thread Tim Allison
Let me actually take a look before answering. Sorry!

On Thu, Dec 13, 2018 at 5:30 PM Tim Allison  wrote:

>  Thank you for reading the reports!!!
>
> The files are very likely broken.  I can take a look.  The change was
> probably because of an "upgrade" to junrar.  Should I revert to the
> version we used in 1.19.1?
> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif 
> wrote:
> >
> > Hi Tim,
> >
> > Reading your great reports, I also saw some new exceptions with RAR files
> > in likely broken folder, but seems tika was able to extract some text
> from
> > them before. Do you know if those files are really broken and why tika
> > extracted text from them before?
> >
> > Thank you,
> > Luis
> >
> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
> > escreveu:
> >
> > > Reports are here:
> > >
> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
> > >
> > > I'm going to revert the mp4 parser, and commit the few dependency
> > > upgrades I ran.
> > >
> > > The _major_ difference in content for ppt is explained by the
> > > duplication of header/footer info.  To confirm this, note that the
> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> > > identical for nearly all ppt->ppt, but there are far more tokens in
> > > "num_tokens_a" vs "num_tokens_b".
> > >
> > > I also see that we're losing content in x-java and x-groovy, etc., but
> > > that's because we're now suppressing the style markup that our parser
> > > was (incorrectly, IMHO, inserting) -- check the values in
> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> > > weight: 3 | family: 2
> > >
> > > In short, I think we're good to go.  Will roll rc1 later today or
> > > (more likely) tomorrow unless there are objections.
> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison 
> wrote:
> > > >
> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
> > > shortly.
> > > > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> > > > >
> > > > > Hi,
> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison 
> wrote:
> > > > >
> > > > > > Dave,
> > > > > >   Should I try to get the Docker plugin working again?
> > > > > >
> > > > >
> > > > > That would be great. I think I may have went down the wrong path
> > > building
> > > > > an image at package time, as there doesn't seem to be an easy way
> to
> > > > > publish it as an Apache labelled org on Dockerhub unless it builds
> from
> > > > > source.
> > > > >
> > > > > I have some time over the weekend, so could update to where I got
> to
> > > and
> > > > > see what you think.
> > > > >
> > > > > Cheers,
> > > > > Dave
> > >
>


Re: 1.20?

2018-12-13 Thread Tim Allison
 Thank you for reading the reports!!!

The files are very likely broken.  I can take a look.  The change was
probably because of an "upgrade" to junrar.  Should I revert to the
version we used in 1.19.1?
On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif  wrote:
>
> Hi Tim,
>
> Reading your great reports, I also saw some new exceptions with RAR files
> in likely broken folder, but seems tika was able to extract some text from
> them before. Do you know if those files are really broken and why tika
> extracted text from them before?
>
> Thank you,
> Luis
>
> Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
> escreveu:
>
> > Reports are here:
> >
> > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
> >
> > I'm going to revert the mp4 parser, and commit the few dependency
> > upgrades I ran.
> >
> > The _major_ difference in content for ppt is explained by the
> > duplication of header/footer info.  To confirm this, note that the
> > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> > identical for nearly all ppt->ppt, but there are far more tokens in
> > "num_tokens_a" vs "num_tokens_b".
> >
> > I also see that we're losing content in x-java and x-groovy, etc., but
> > that's because we're now suppressing the style markup that our parser
> > was (incorrectly, IMHO, inserting) -- check the values in
> > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> > weight: 3 | family: 2
> >
> > In short, I think we're good to go.  Will roll rc1 later today or
> > (more likely) tomorrow unless there are objections.
> > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
> > >
> > > Any blockers on 1.20?  I'm going to kick off the regression tests
> > shortly.
> > > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> > > >
> > > > Hi,
> > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
> > > >
> > > > > Dave,
> > > > >   Should I try to get the Docker plugin working again?
> > > > >
> > > >
> > > > That would be great. I think I may have went down the wrong path
> > building
> > > > an image at package time, as there doesn't seem to be an easy way to
> > > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > > source.
> > > >
> > > > I have some time over the weekend, so could update to where I got to
> > and
> > > > see what you think.
> > > >
> > > > Cheers,
> > > > Dave
> >


Re: 1.20?

2018-12-13 Thread Luís Filipe Nassif
Hi Tim,

Reading your great reports, I also saw some new exceptions with RAR files
in likely broken folder, but seems tika was able to extract some text from
them before. Do you know if those files are really broken and why tika
extracted text from them before?

Thank you,
Luis

Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
escreveu:

> Reports are here:
>
> http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>
> I'm going to revert the mp4 parser, and commit the few dependency
> upgrades I ran.
>
> The _major_ difference in content for ppt is explained by the
> duplication of header/footer info.  To confirm this, note that the
> values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> identical for nearly all ppt->ppt, but there are far more tokens in
> "num_tokens_a" vs "num_tokens_b".
>
> I also see that we're losing content in x-java and x-groovy, etc., but
> that's because we're now suppressing the style markup that our parser
> was (incorrectly, IMHO, inserting) -- check the values in
> "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> weight: 3 | family: 2
>
> In short, I think we're good to go.  Will roll rc1 later today or
> (more likely) tomorrow unless there are objections.
> On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
> >
> > Any blockers on 1.20?  I'm going to kick off the regression tests
> shortly.
> > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> > >
> > > Hi,
> > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
> > >
> > > > Dave,
> > > >   Should I try to get the Docker plugin working again?
> > > >
> > >
> > > That would be great. I think I may have went down the wrong path
> building
> > > an image at package time, as there doesn't seem to be an easy way to
> > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > source.
> > >
> > > I have some time over the weekend, so could update to where I got to
> and
> > > see what you think.
> > >
> > > Cheers,
> > > Dave
>


Re: 1.20?

2018-12-13 Thread Chris Mattmann
Roll forward! Yay!

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, December 13, 2018 at 7:02 AM
To: "dev@tika.apache.org" 
Subject: Re: 1.20?

 

Reports are here:

 

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

 

I'm going to revert the mp4 parser, and commit the few dependency

upgrades I ran.

 

The _major_ difference in content for ppt is explained by the

duplication of header/footer info.  To confirm this, note that the

values for "num_unique_tokens_a" and "num_unique_tokens_b" are

identical for nearly all ppt->ppt, but there are far more tokens in

"num_tokens_a" vs "num_tokens_b".

 

I also see that we're losing content in x-java and x-groovy, etc., but

that's because we're now suppressing the style markup that our parser

was (incorrectly, IMHO, inserting) -- check the values in

"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |

0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |

weight: 3 | family: 2

 

In short, I think we're good to go.  Will roll rc1 later today or

(more likely) tomorrow unless there are objections.

On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:

 

Any blockers on 1.20?  I'm going to kick off the regression tests shortly.

On Fri, Nov 30, 2018 at 7:39 PM  wrote:

> 

> Hi,

> On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:

> 

> > Dave,

> >   Should I try to get the Docker plugin working again?

> >

> 

> That would be great. I think I may have went down the wrong path building

> an image at package time, as there doesn't seem to be an easy way to

> publish it as an Apache labelled org on Dockerhub unless it builds from

> source.

> 

> I have some time over the weekend, so could update to where I got to and

> see what you think.

> 

> Cheers,

> Dave

 



Re: 1.20?

2018-12-13 Thread Tim Allison
Reports are here:

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

I'm going to revert the mp4 parser, and commit the few dependency
upgrades I ran.

The _major_ difference in content for ppt is explained by the
duplication of header/footer info.  To confirm this, note that the
values for "num_unique_tokens_a" and "num_unique_tokens_b" are
identical for nearly all ppt->ppt, but there are far more tokens in
"num_tokens_a" vs "num_tokens_b".

I also see that we're losing content in x-java and x-groovy, etc., but
that's because we're now suppressing the style markup that our parser
was (incorrectly, IMHO, inserting) -- check the values in
"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
weight: 3 | family: 2

In short, I think we're good to go.  Will roll rc1 later today or
(more likely) tomorrow unless there are objections.
On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
>
> Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
> On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> >
> > Hi,
> > On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
> >
> > > Dave,
> > >   Should I try to get the Docker plugin working again?
> > >
> >
> > That would be great. I think I may have went down the wrong path building
> > an image at package time, as there doesn't seem to be an easy way to
> > publish it as an Apache labelled org on Dockerhub unless it builds from
> > source.
> >
> > I have some time over the weekend, so could update to where I got to and
> > see what you think.
> >
> > Cheers,
> > Dave


Re: 1.20?

2018-12-10 Thread Tim Allison
Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
On Fri, Nov 30, 2018 at 7:39 PM  wrote:
>
> Hi,
> On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
>
> > Dave,
> >   Should I try to get the Docker plugin working again?
> >
>
> That would be great. I think I may have went down the wrong path building
> an image at package time, as there doesn't seem to be an easy way to
> publish it as an Apache labelled org on Dockerhub unless it builds from
> source.
>
> I have some time over the weekend, so could update to where I got to and
> see what you think.
>
> Cheers,
> Dave


Re: 1.20?

2018-11-30 Thread loompa
Hi,
On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:

> Dave,
>   Should I try to get the Docker plugin working again?
>

That would be great. I think I may have went down the wrong path building
an image at package time, as there doesn't seem to be an easy way to
publish it as an Apache labelled org on Dockerhub unless it builds from
source.

I have some time over the weekend, so could update to where I got to and
see what you think.

Cheers,
Dave


Re: 1.20?

2018-11-28 Thread Lewis John McGibbney
+1 would be nice to get the recent ENVI work released as well folks. 

On 2018/11/20 23:04:29, Tim Allison  wrote: 
> All,
>POI 4.0.1 will be out shortly with some important bug fixes.  What would
> you all think of targeting 1st/2nd week of December for 1.20?
> 
>  Cheers,
>  Tim
> 


Re: 1.20?

2018-11-21 Thread Tim Allison
Dave,
  Should I try to get the Docker plugin working again?

On Tue, Nov 20, 2018 at 6:21 PM Chris Mattmann  wrote:

> Love it and I can align tika-python with that too ☺
>
>
>
>
>
>
>
> From: Tim Allison 
> Reply-To: "dev@tika.apache.org" 
> Date: Tuesday, November 20, 2018 at 3:04 PM
> To: "dev@tika.apache.org" 
> Subject: 1.20?
>
>
>
> All,
>
>POI 4.0.1 will be out shortly with some important bug fixes.  What would
>
> you all think of targeting 1st/2nd week of December for 1.20?
>
>
>
>  Cheers,
>
>  Tim
>
>
>
>


Re: 1.20?

2018-11-20 Thread Chris Mattmann
Love it and I can align tika-python with that too ☺

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, November 20, 2018 at 3:04 PM
To: "dev@tika.apache.org" 
Subject: 1.20?

 

All,

   POI 4.0.1 will be out shortly with some important bug fixes.  What would

you all think of targeting 1st/2nd week of December for 1.20?

 

 Cheers,

 Tim