Re: 1.20?
Reports on mp4s, junrar, msaccess and a random subset of the regression corpus are available here: http://162.242.228.174/reports/reports_tika_1_20-rc1_subset.tgz On Thu, Dec 13, 2018 at 5:34 PM Tim Allison wrote: > > Let me actually take a look before answering. Sorry! > > On Thu, Dec 13, 2018 at 5:30 PM Tim Allison wrote: >> >> Thank you for reading the reports!!! >> >> The files are very likely broken. I can take a look. The change was >> probably because of an "upgrade" to junrar. Should I revert to the >> version we used in 1.19.1? >> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif >> wrote: >> > >> > Hi Tim, >> > >> > Reading your great reports, I also saw some new exceptions with RAR files >> > in likely broken folder, but seems tika was able to extract some text from >> > them before. Do you know if those files are really broken and why tika >> > extracted text from them before? >> > >> > Thank you, >> > Luis >> > >> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison >> > escreveu: >> > >> > > Reports are here: >> > > >> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip >> > > >> > > I'm going to revert the mp4 parser, and commit the few dependency >> > > upgrades I ran. >> > > >> > > The _major_ difference in content for ppt is explained by the >> > > duplication of header/footer info. To confirm this, note that the >> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are >> > > identical for nearly all ppt->ppt, but there are far more tokens in >> > > "num_tokens_a" vs "num_tokens_b". >> > > >> > > I also see that we're losing content in x-java and x-groovy, etc., but >> > > that's because we're now suppressing the style markup that our parser >> > > was (incorrectly, IMHO, inserting) -- check the values in >> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | >> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | >> > > weight: 3 | family: 2 >> > > >> > > In short, I think we're good to go. Will roll rc1 later today or >> > > (more likely) tomorrow unless there are objections. >> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: >> > > > >> > > > Any blockers on 1.20? I'm going to kick off the regression tests >> > > shortly. >> > > > On Fri, Nov 30, 2018 at 7:39 PM wrote: >> > > > > >> > > > > Hi, >> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison >> > > > > wrote: >> > > > > >> > > > > > Dave, >> > > > > > Should I try to get the Docker plugin working again? >> > > > > > >> > > > > >> > > > > That would be great. I think I may have went down the wrong path >> > > building >> > > > > an image at package time, as there doesn't seem to be an easy way to >> > > > > publish it as an Apache labelled org on Dockerhub unless it builds >> > > > > from >> > > > > source. >> > > > > >> > > > > I have some time over the weekend, so could update to where I got to >> > > and >> > > > > see what you think. >> > > > > >> > > > > Cheers, >> > > > > Dave >> > >
Re: 1.20?
Thank you, again, Luís Filipe Nassif! There's no point in having reports unless we pay attention to them :P. I reverted junrar to where it was in 1.19.1. I also reverted jackcess based on the reports. All, On the theory that it isn't a great idea to push to production on a Friday. I'm going to let the recent changes rest over the weekend. I'll rerun some tests on a subset of the regression corpus on Monday and then roll rc1. If anyone wants to kick the tires on the recent version changes, including parsers that depend on the upgraded guava, that'd be great! Onward! Cheers, Tim On Thu, Dec 13, 2018 at 5:34 PM Tim Allison wrote: > > Let me actually take a look before answering. Sorry! > > On Thu, Dec 13, 2018 at 5:30 PM Tim Allison wrote: >> >> Thank you for reading the reports!!! >> >> The files are very likely broken. I can take a look. The change was >> probably because of an "upgrade" to junrar. Should I revert to the >> version we used in 1.19.1? >> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif >> wrote: >> > >> > Hi Tim, >> > >> > Reading your great reports, I also saw some new exceptions with RAR files >> > in likely broken folder, but seems tika was able to extract some text from >> > them before. Do you know if those files are really broken and why tika >> > extracted text from them before? >> > >> > Thank you, >> > Luis >> > >> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison >> > escreveu: >> > >> > > Reports are here: >> > > >> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip >> > > >> > > I'm going to revert the mp4 parser, and commit the few dependency >> > > upgrades I ran. >> > > >> > > The _major_ difference in content for ppt is explained by the >> > > duplication of header/footer info. To confirm this, note that the >> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are >> > > identical for nearly all ppt->ppt, but there are far more tokens in >> > > "num_tokens_a" vs "num_tokens_b". >> > > >> > > I also see that we're losing content in x-java and x-groovy, etc., but >> > > that's because we're now suppressing the style markup that our parser >> > > was (incorrectly, IMHO, inserting) -- check the values in >> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | >> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | >> > > weight: 3 | family: 2 >> > > >> > > In short, I think we're good to go. Will roll rc1 later today or >> > > (more likely) tomorrow unless there are objections. >> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: >> > > > >> > > > Any blockers on 1.20? I'm going to kick off the regression tests >> > > shortly. >> > > > On Fri, Nov 30, 2018 at 7:39 PM wrote: >> > > > > >> > > > > Hi, >> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison >> > > > > wrote: >> > > > > >> > > > > > Dave, >> > > > > > Should I try to get the Docker plugin working again? >> > > > > > >> > > > > >> > > > > That would be great. I think I may have went down the wrong path >> > > building >> > > > > an image at package time, as there doesn't seem to be an easy way to >> > > > > publish it as an Apache labelled org on Dockerhub unless it builds >> > > > > from >> > > > > source. >> > > > > >> > > > > I have some time over the weekend, so could update to where I got to >> > > and >> > > > > see what you think. >> > > > > >> > > > > Cheers, >> > > > > Dave >> > >
Re: 1.20?
Let me actually take a look before answering. Sorry! On Thu, Dec 13, 2018 at 5:30 PM Tim Allison wrote: > Thank you for reading the reports!!! > > The files are very likely broken. I can take a look. The change was > probably because of an "upgrade" to junrar. Should I revert to the > version we used in 1.19.1? > On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif > wrote: > > > > Hi Tim, > > > > Reading your great reports, I also saw some new exceptions with RAR files > > in likely broken folder, but seems tika was able to extract some text > from > > them before. Do you know if those files are really broken and why tika > > extracted text from them before? > > > > Thank you, > > Luis > > > > Em qui, 13 de dez de 2018 às 13:02, Tim Allison > > escreveu: > > > > > Reports are here: > > > > > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip > > > > > > I'm going to revert the mp4 parser, and commit the few dependency > > > upgrades I ran. > > > > > > The _major_ difference in content for ppt is explained by the > > > duplication of header/footer info. To confirm this, note that the > > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are > > > identical for nearly all ppt->ppt, but there are far more tokens in > > > "num_tokens_a" vs "num_tokens_b". > > > > > > I also see that we're losing content in x-java and x-groovy, etc., but > > > that's because we're now suppressing the style markup that our parser > > > was (incorrectly, IMHO, inserting) -- check the values in > > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | > > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | > > > weight: 3 | family: 2 > > > > > > In short, I think we're good to go. Will roll rc1 later today or > > > (more likely) tomorrow unless there are objections. > > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison > wrote: > > > > > > > > Any blockers on 1.20? I'm going to kick off the regression tests > > > shortly. > > > > On Fri, Nov 30, 2018 at 7:39 PM wrote: > > > > > > > > > > Hi, > > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison > wrote: > > > > > > > > > > > Dave, > > > > > > Should I try to get the Docker plugin working again? > > > > > > > > > > > > > > > > That would be great. I think I may have went down the wrong path > > > building > > > > > an image at package time, as there doesn't seem to be an easy way > to > > > > > publish it as an Apache labelled org on Dockerhub unless it builds > from > > > > > source. > > > > > > > > > > I have some time over the weekend, so could update to where I got > to > > > and > > > > > see what you think. > > > > > > > > > > Cheers, > > > > > Dave > > > >
Re: 1.20?
Thank you for reading the reports!!! The files are very likely broken. I can take a look. The change was probably because of an "upgrade" to junrar. Should I revert to the version we used in 1.19.1? On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif wrote: > > Hi Tim, > > Reading your great reports, I also saw some new exceptions with RAR files > in likely broken folder, but seems tika was able to extract some text from > them before. Do you know if those files are really broken and why tika > extracted text from them before? > > Thank you, > Luis > > Em qui, 13 de dez de 2018 às 13:02, Tim Allison > escreveu: > > > Reports are here: > > > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip > > > > I'm going to revert the mp4 parser, and commit the few dependency > > upgrades I ran. > > > > The _major_ difference in content for ppt is explained by the > > duplication of header/footer info. To confirm this, note that the > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are > > identical for nearly all ppt->ppt, but there are far more tokens in > > "num_tokens_a" vs "num_tokens_b". > > > > I also see that we're losing content in x-java and x-groovy, etc., but > > that's because we're now suppressing the style markup that our parser > > was (incorrectly, IMHO, inserting) -- check the values in > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | > > weight: 3 | family: 2 > > > > In short, I think we're good to go. Will roll rc1 later today or > > (more likely) tomorrow unless there are objections. > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: > > > > > > Any blockers on 1.20? I'm going to kick off the regression tests > > shortly. > > > On Fri, Nov 30, 2018 at 7:39 PM wrote: > > > > > > > > Hi, > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > > > > > > > Dave, > > > > > Should I try to get the Docker plugin working again? > > > > > > > > > > > > > That would be great. I think I may have went down the wrong path > > building > > > > an image at package time, as there doesn't seem to be an easy way to > > > > publish it as an Apache labelled org on Dockerhub unless it builds from > > > > source. > > > > > > > > I have some time over the weekend, so could update to where I got to > > and > > > > see what you think. > > > > > > > > Cheers, > > > > Dave > >
Re: 1.20?
Hi Tim, Reading your great reports, I also saw some new exceptions with RAR files in likely broken folder, but seems tika was able to extract some text from them before. Do you know if those files are really broken and why tika extracted text from them before? Thank you, Luis Em qui, 13 de dez de 2018 às 13:02, Tim Allison escreveu: > Reports are here: > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip > > I'm going to revert the mp4 parser, and commit the few dependency > upgrades I ran. > > The _major_ difference in content for ppt is explained by the > duplication of header/footer info. To confirm this, note that the > values for "num_unique_tokens_a" and "num_unique_tokens_b" are > identical for nearly all ppt->ppt, but there are far more tokens in > "num_tokens_a" vs "num_tokens_b". > > I also see that we're losing content in x-java and x-groovy, etc., but > that's because we're now suppressing the style markup that our parser > was (incorrectly, IMHO, inserting) -- check the values in > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | > weight: 3 | family: 2 > > In short, I think we're good to go. Will roll rc1 later today or > (more likely) tomorrow unless there are objections. > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: > > > > Any blockers on 1.20? I'm going to kick off the regression tests > shortly. > > On Fri, Nov 30, 2018 at 7:39 PM wrote: > > > > > > Hi, > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > > > > > Dave, > > > > Should I try to get the Docker plugin working again? > > > > > > > > > > That would be great. I think I may have went down the wrong path > building > > > an image at package time, as there doesn't seem to be an easy way to > > > publish it as an Apache labelled org on Dockerhub unless it builds from > > > source. > > > > > > I have some time over the weekend, so could update to where I got to > and > > > see what you think. > > > > > > Cheers, > > > Dave >
Re: 1.20?
Roll forward! Yay! From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Thursday, December 13, 2018 at 7:02 AM To: "dev@tika.apache.org" Subject: Re: 1.20? Reports are here: http://162.242.228.174/reports/tika_1_20-pre-rc1.zip I'm going to revert the mp4 parser, and commit the few dependency upgrades I ran. The _major_ difference in content for ppt is explained by the duplication of header/footer info. To confirm this, note that the values for "num_unique_tokens_a" and "num_unique_tokens_b" are identical for nearly all ppt->ppt, but there are far more tokens in "num_tokens_a" vs "num_tokens_b". I also see that we're losing content in x-java and x-groovy, etc., but that's because we're now suppressing the style markup that our parser was (incorrectly, IMHO, inserting) -- check the values in "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | weight: 3 | family: 2 In short, I think we're good to go. Will roll rc1 later today or (more likely) tomorrow unless there are objections. On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: Any blockers on 1.20? I'm going to kick off the regression tests shortly. On Fri, Nov 30, 2018 at 7:39 PM wrote: > > Hi, > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > Dave, > > Should I try to get the Docker plugin working again? > > > > That would be great. I think I may have went down the wrong path building > an image at package time, as there doesn't seem to be an easy way to > publish it as an Apache labelled org on Dockerhub unless it builds from > source. > > I have some time over the weekend, so could update to where I got to and > see what you think. > > Cheers, > Dave
Re: 1.20?
Reports are here: http://162.242.228.174/reports/tika_1_20-pre-rc1.zip I'm going to revert the mp4 parser, and commit the few dependency upgrades I ran. The _major_ difference in content for ppt is explained by the duplication of header/footer info. To confirm this, note that the values for "num_unique_tokens_a" and "num_unique_tokens_b" are identical for nearly all ppt->ppt, but there are far more tokens in "num_tokens_a" vs "num_tokens_b". I also see that we're losing content in x-java and x-groovy, etc., but that's because we're now suppressing the style markup that our parser was (incorrectly, IMHO, inserting) -- check the values in "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | weight: 3 | family: 2 In short, I think we're good to go. Will roll rc1 later today or (more likely) tomorrow unless there are objections. On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: > > Any blockers on 1.20? I'm going to kick off the regression tests shortly. > On Fri, Nov 30, 2018 at 7:39 PM wrote: > > > > Hi, > > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > > > Dave, > > > Should I try to get the Docker plugin working again? > > > > > > > That would be great. I think I may have went down the wrong path building > > an image at package time, as there doesn't seem to be an easy way to > > publish it as an Apache labelled org on Dockerhub unless it builds from > > source. > > > > I have some time over the weekend, so could update to where I got to and > > see what you think. > > > > Cheers, > > Dave
Re: 1.20?
Any blockers on 1.20? I'm going to kick off the regression tests shortly. On Fri, Nov 30, 2018 at 7:39 PM wrote: > > Hi, > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > Dave, > > Should I try to get the Docker plugin working again? > > > > That would be great. I think I may have went down the wrong path building > an image at package time, as there doesn't seem to be an easy way to > publish it as an Apache labelled org on Dockerhub unless it builds from > source. > > I have some time over the weekend, so could update to where I got to and > see what you think. > > Cheers, > Dave
Re: 1.20?
Hi, On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > Dave, > Should I try to get the Docker plugin working again? > That would be great. I think I may have went down the wrong path building an image at package time, as there doesn't seem to be an easy way to publish it as an Apache labelled org on Dockerhub unless it builds from source. I have some time over the weekend, so could update to where I got to and see what you think. Cheers, Dave
Re: 1.20?
+1 would be nice to get the recent ENVI work released as well folks. On 2018/11/20 23:04:29, Tim Allison wrote: > All, >POI 4.0.1 will be out shortly with some important bug fixes. What would > you all think of targeting 1st/2nd week of December for 1.20? > > Cheers, > Tim >
Re: 1.20?
Dave, Should I try to get the Docker plugin working again? On Tue, Nov 20, 2018 at 6:21 PM Chris Mattmann wrote: > Love it and I can align tika-python with that too ☺ > > > > > > > > From: Tim Allison > Reply-To: "dev@tika.apache.org" > Date: Tuesday, November 20, 2018 at 3:04 PM > To: "dev@tika.apache.org" > Subject: 1.20? > > > > All, > >POI 4.0.1 will be out shortly with some important bug fixes. What would > > you all think of targeting 1st/2nd week of December for 1.20? > > > > Cheers, > > Tim > > > >
Re: 1.20?
Love it and I can align tika-python with that too ☺ From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Tuesday, November 20, 2018 at 3:04 PM To: "dev@tika.apache.org" Subject: 1.20? All, POI 4.0.1 will be out shortly with some important bug fixes. What would you all think of targeting 1st/2nd week of December for 1.20? Cheers, Tim