I think what Spark needs to do depends on how we decide to fix this. I think the right thing to do for now is to treat these stats as invalid and not return them, like we did for binary in PARQUET-251. That's an easy fix because we've done it before and would just need to add another check what we did the last time. That way, Spark is automatically fixed when it updates to 1.9.0. We can work in 1.9.1 and later to fix how stats are written and agree on an update to the format.
I just posted a pull request that implements this. It doesn't require a parquet-format release and will fix the problem for downstream readers by not returning incorrect stats, based on the column type. In the patch, stats are suppressed by default for unsigned integers, strings, enums, and decimals. Readers can override this for string types by setting parquet.strings.use-signed-order. Does that seem like a reasonable fix to get this release out? rb On Fri, Sep 9, 2016 at 6:13 AM, Andrew Duffy <[email protected]> wrote: > So I’m cool with making necessary changes to get this in sooner rather > than later, I’ve mostly been blocking on code reviews. If there’s a > commitment made to releasing 1.9.1 as soon as the binary sort order change > goes in then I’m fine with not blocking on this for 1.9, but if not and we > can expect the current velocity of releases then I’d rather see if we can > get it in now. Especially because Spark and others will need to wait until > next release (>= 1 year from now?) to fix use of statistics for binary > data. As Rob said, this is broken in master of Spark so without a > fix+release of Parquet it’ll likely require disabling statistics pushdown > for binary columns on their end. > > -Andrew > > On 9/8/16, 11:37 PM, "Jacques Nadeau" <[email protected]> wrote: > > >Absolutely agree the fix would be good to get in. > > > >My comments come from the fact that we tried to start the 1.9 release > (late > >last year I think) after the direct memory feature got in and it never > >completed (as there is always a good reason to hold the release). My main > >fear is that if we wait for something and it turns into another > substantial > >delay before a release. Ryan, what do you think about proposing a hard > >deadline? > > > >On Thu, Sep 8, 2016 at 7:57 PM, Robert Kruszewski <[email protected]> > >wrote: > > > >> When would the fix be ready by to consider it for release in 1.9.0? This > >> affects correctness on current master of spark and would be good to get > a > >> release with it fixed so that people aren’t using potentially bad > versions. > >> > >> - Robert > >> > >> On 9/8/16, 10:44 PM, "Jacques Nadeau" <[email protected]> wrote: > >> > >> A non-binding +1 from me on releasing sooner/more often. > >> > >> On Thu, Sep 8, 2016 at 5:44 PM, Ryan Blue <[email protected] > > > >> wrote: > >> > >> > Hey everyone, > >> > > >> > I'd like to put together a release candidate for 1.9.0. The other > >> issues > >> > are done, but the sort order min/max issue, PARQUET-686 is still > >> open. > >> > > >> > I'm okay releasing 1.9.0 without fixing that issue since it's been > >> so long > >> > since our last release. It would also be nice to do releases as > >> necessary, > >> > so we can always do a release to fix PARQUET-686 when the patch > for > >> it is > >> > ready. Is there anyone that thinks we should definitely get this > >> into the > >> > 1.9.0 release? > >> > > >> > Thanks, > >> > > >> > rb > >> > > >> > -- > >> > Ryan Blue > >> > Software Engineer > >> > Netflix > >> > > >> > >> > -- Ryan Blue Software Engineer Netflix
