Just following up here. It seems like we stalled out due to this issue. Per my previous comments, if we don't have a resolution for this, and it isn't regression, I propose moving forward without a change.
On Mon, Sep 12, 2016 at 12:57 PM, Piyush Narang <[email protected] > wrote: > Ok, think that sounds good to me. Shall take a look at your PR as well. > > On Mon, Sep 12, 2016 at 10:49 AM, Ryan Blue <[email protected]> > wrote: > > > Thanks, Piyush. I agree and would like to get the release out before we > > completely finish the changes for the sort order bug. But, I think it is > > important to do something to address the problem that stats pages for > > binary/UTF8 don't match Java's String sort order or UTF-8. That's why I > > posted a PR with minimal changes to disable returning stats using the > > signed order, unless a property is set to opt into using them. > > > > Alex is arguing that the current signed min/max shouldn't be considered a > > bug so we shouldn't suppress them in the 1.9.0 release. I think we should > > agree on whether this is a correctness issue and, if it is, fix it before > > the release. We can fix the sort order in 1.9.1. > > > > rb > > > > On Mon, Sep 12, 2016 at 10:43 AM, Piyush Narang > > <[email protected] > > > wrote: > > > > > Are there other projects / folks waiting on the 1.9.0 release? Given > that > > > it has been a while since our last release, it might make sense for us > to > > > first put that out (if I understand correctly, everything else has been > > > resolved). Seems like there's a bit of discussion on Andrew's PR and we > > > could follow up with another release once that is resolved. I'm worried > > > about the release goal post constantly moving and us adding more and > more > > > to the 1.9.0 release. > > > > > > On Sat, Sep 10, 2016 at 11:42 AM, Ryan Blue <[email protected] > > > > > wrote: > > > > > > > To keep the list up-to-date, there's more discussion on this topic on > > > > Andrew's original pull request [1] about whether we should consider > > this > > > a > > > > bug. I don't think we should build a release candidate until we agree > > > that > > > > it's a bug or not, and if it is a bug add a work-around to stop > > returning > > > > the bad min and max. If you're interested, please join the discussion > > on > > > > the PR. > > > > > > > > Thanks! > > > > > > > > rb > > > > > > > > > > > > [1]: https://github.com/apache/parquet-mr/pull/362 > > > > > > > > On Fri, Sep 9, 2016 at 2:50 PM, Ryan Blue <[email protected]> wrote: > > > > > > > > > I think what Spark needs to do depends on how we decide to fix > this. > > I > > > > > think the right thing to do for now is to treat these stats as > > invalid > > > > and > > > > > not return them, like we did for binary in PARQUET-251. That's an > > easy > > > > fix > > > > > because we've done it before and would just need to add another > check > > > > what > > > > > we did the last time. That way, Spark is automatically fixed when > it > > > > > updates to 1.9.0. We can work in 1.9.1 and later to fix how stats > are > > > > > written and agree on an update to the format. > > > > > > > > > > I just posted a pull request that implements this. It doesn't > > require a > > > > > parquet-format release and will fix the problem for downstream > > readers > > > by > > > > > not returning incorrect stats, based on the column type. In the > > patch, > > > > > stats are suppressed by default for unsigned integers, strings, > > enums, > > > > and > > > > > decimals. Readers can override this for string types by setting > > > > > parquet.strings.use-signed-order. > > > > > > > > > > Does that seem like a reasonable fix to get this release out? > > > > > > > > > > rb > > > > > > > > > > On Fri, Sep 9, 2016 at 6:13 AM, Andrew Duffy <[email protected]> > > > > wrote: > > > > > > > > > >> So I’m cool with making necessary changes to get this in sooner > > rather > > > > >> than later, I’ve mostly been blocking on code reviews. If there’s > a > > > > >> commitment made to releasing 1.9.1 as soon as the binary sort > order > > > > change > > > > >> goes in then I’m fine with not blocking on this for 1.9, but if > not > > > and > > > > we > > > > >> can expect the current velocity of releases then I’d rather see if > > we > > > > can > > > > >> get it in now. Especially because Spark and others will need to > wait > > > > until > > > > >> next release (>= 1 year from now?) to fix use of statistics for > > binary > > > > >> data. As Rob said, this is broken in master of Spark so without a > > > > >> fix+release of Parquet it’ll likely require disabling statistics > > > > pushdown > > > > >> for binary columns on their end. > > > > >> > > > > >> -Andrew > > > > >> > > > > >> On 9/8/16, 11:37 PM, "Jacques Nadeau" <[email protected]> wrote: > > > > >> > > > > >> >Absolutely agree the fix would be good to get in. > > > > >> > > > > > >> >My comments come from the fact that we tried to start the 1.9 > > release > > > > >> (late > > > > >> >last year I think) after the direct memory feature got in and it > > > never > > > > >> >completed (as there is always a good reason to hold the release). > > My > > > > main > > > > >> >fear is that if we wait for something and it turns into another > > > > >> substantial > > > > >> >delay before a release. Ryan, what do you think about proposing a > > > hard > > > > >> >deadline? > > > > >> > > > > > >> >On Thu, Sep 8, 2016 at 7:57 PM, Robert Kruszewski < > > > > [email protected]> > > > > >> >wrote: > > > > >> > > > > > >> >> When would the fix be ready by to consider it for release in > > 1.9.0? > > > > >> This > > > > >> >> affects correctness on current master of spark and would be > good > > to > > > > >> get a > > > > >> >> release with it fixed so that people aren’t using potentially > bad > > > > >> versions. > > > > >> >> > > > > >> >> - Robert > > > > >> >> > > > > >> >> On 9/8/16, 10:44 PM, "Jacques Nadeau" <[email protected]> > > wrote: > > > > >> >> > > > > >> >> A non-binding +1 from me on releasing sooner/more often. > > > > >> >> > > > > >> >> On Thu, Sep 8, 2016 at 5:44 PM, Ryan Blue > > > > >> <[email protected]> > > > > >> >> wrote: > > > > >> >> > > > > >> >> > Hey everyone, > > > > >> >> > > > > > >> >> > I'd like to put together a release candidate for 1.9.0. > The > > > > other > > > > >> >> issues > > > > >> >> > are done, but the sort order min/max issue, PARQUET-686 > is > > > > still > > > > >> >> open. > > > > >> >> > > > > > >> >> > I'm okay releasing 1.9.0 without fixing that issue since > > it's > > > > >> been > > > > >> >> so long > > > > >> >> > since our last release. It would also be nice to do > > releases > > > as > > > > >> >> necessary, > > > > >> >> > so we can always do a release to fix PARQUET-686 when the > > > patch > > > > >> for > > > > >> >> it is > > > > >> >> > ready. Is there anyone that thinks we should definitely > get > > > > this > > > > >> >> into the > > > > >> >> > 1.9.0 release? > > > > >> >> > > > > > >> >> > Thanks, > > > > >> >> > > > > > >> >> > rb > > > > >> >> > > > > > >> >> > -- > > > > >> >> > Ryan Blue > > > > >> >> > Software Engineer > > > > >> >> > Netflix > > > > >> >> > > > > > >> >> > > > > >> >> > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > Ryan Blue > > > > > Software Engineer > > > > > Netflix > > > > > > > > > > > > > > > > > > > > > -- > > > > Ryan Blue > > > > Software Engineer > > > > Netflix > > > > > > > > > > > > > > > > -- > > > - Piyush > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > - Piyush >
