To keep the list up-to-date, there's more discussion on this topic on Andrew's original pull request [1] about whether we should consider this a bug. I don't think we should build a release candidate until we agree that it's a bug or not, and if it is a bug add a work-around to stop returning the bad min and max. If you're interested, please join the discussion on the PR.
Thanks! rb [1]: https://github.com/apache/parquet-mr/pull/362 On Fri, Sep 9, 2016 at 2:50 PM, Ryan Blue <[email protected]> wrote: > I think what Spark needs to do depends on how we decide to fix this. I > think the right thing to do for now is to treat these stats as invalid and > not return them, like we did for binary in PARQUET-251. That's an easy fix > because we've done it before and would just need to add another check what > we did the last time. That way, Spark is automatically fixed when it > updates to 1.9.0. We can work in 1.9.1 and later to fix how stats are > written and agree on an update to the format. > > I just posted a pull request that implements this. It doesn't require a > parquet-format release and will fix the problem for downstream readers by > not returning incorrect stats, based on the column type. In the patch, > stats are suppressed by default for unsigned integers, strings, enums, and > decimals. Readers can override this for string types by setting > parquet.strings.use-signed-order. > > Does that seem like a reasonable fix to get this release out? > > rb > > On Fri, Sep 9, 2016 at 6:13 AM, Andrew Duffy <[email protected]> wrote: > >> So I’m cool with making necessary changes to get this in sooner rather >> than later, I’ve mostly been blocking on code reviews. If there’s a >> commitment made to releasing 1.9.1 as soon as the binary sort order change >> goes in then I’m fine with not blocking on this for 1.9, but if not and we >> can expect the current velocity of releases then I’d rather see if we can >> get it in now. Especially because Spark and others will need to wait until >> next release (>= 1 year from now?) to fix use of statistics for binary >> data. As Rob said, this is broken in master of Spark so without a >> fix+release of Parquet it’ll likely require disabling statistics pushdown >> for binary columns on their end. >> >> -Andrew >> >> On 9/8/16, 11:37 PM, "Jacques Nadeau" <[email protected]> wrote: >> >> >Absolutely agree the fix would be good to get in. >> > >> >My comments come from the fact that we tried to start the 1.9 release >> (late >> >last year I think) after the direct memory feature got in and it never >> >completed (as there is always a good reason to hold the release). My main >> >fear is that if we wait for something and it turns into another >> substantial >> >delay before a release. Ryan, what do you think about proposing a hard >> >deadline? >> > >> >On Thu, Sep 8, 2016 at 7:57 PM, Robert Kruszewski <[email protected]> >> >wrote: >> > >> >> When would the fix be ready by to consider it for release in 1.9.0? >> This >> >> affects correctness on current master of spark and would be good to >> get a >> >> release with it fixed so that people aren’t using potentially bad >> versions. >> >> >> >> - Robert >> >> >> >> On 9/8/16, 10:44 PM, "Jacques Nadeau" <[email protected]> wrote: >> >> >> >> A non-binding +1 from me on releasing sooner/more often. >> >> >> >> On Thu, Sep 8, 2016 at 5:44 PM, Ryan Blue >> <[email protected]> >> >> wrote: >> >> >> >> > Hey everyone, >> >> > >> >> > I'd like to put together a release candidate for 1.9.0. The other >> >> issues >> >> > are done, but the sort order min/max issue, PARQUET-686 is still >> >> open. >> >> > >> >> > I'm okay releasing 1.9.0 without fixing that issue since it's >> been >> >> so long >> >> > since our last release. It would also be nice to do releases as >> >> necessary, >> >> > so we can always do a release to fix PARQUET-686 when the patch >> for >> >> it is >> >> > ready. Is there anyone that thinks we should definitely get this >> >> into the >> >> > 1.9.0 release? >> >> > >> >> > Thanks, >> >> > >> >> > rb >> >> > >> >> > -- >> >> > Ryan Blue >> >> > Software Engineer >> >> > Netflix >> >> > >> >> >> >> >> > > > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix
