To keep the list up-to-date, there's more discussion on this topic on
Andrew's original pull request [1] about whether we should consider this a
bug. I don't think we should build a release candidate until we agree that
it's a bug or not, and if it is a bug add a work-around to stop returning
the bad min and max. If you're interested, please join the discussion on
the PR.

Thanks!

rb


[1]: https://github.com/apache/parquet-mr/pull/362

On Fri, Sep 9, 2016 at 2:50 PM, Ryan Blue <[email protected]> wrote:

> I think what Spark needs to do depends on how we decide to fix this. I
> think the right thing to do for now is to treat these stats as invalid and
> not return them, like we did for binary in PARQUET-251. That's an easy fix
> because we've done it before and would just need to add another check what
> we did the last time. That way, Spark is automatically fixed when it
> updates to 1.9.0. We can work in 1.9.1 and later to fix how stats are
> written and agree on an update to the format.
>
> I just posted a pull request that implements this. It doesn't require a
> parquet-format release and will fix the problem for downstream readers by
> not returning incorrect stats, based on the column type. In the patch,
> stats are suppressed by default for unsigned integers, strings, enums, and
> decimals. Readers can override this for string types by setting
> parquet.strings.use-signed-order.
>
> Does that seem like a reasonable fix to get this release out?
>
> rb
>
> On Fri, Sep 9, 2016 at 6:13 AM, Andrew Duffy <[email protected]> wrote:
>
>> So I’m cool with making necessary changes to get this in sooner rather
>> than later, I’ve mostly been blocking on code reviews. If there’s a
>> commitment made to releasing 1.9.1 as soon as the binary sort order change
>> goes in then I’m fine with not blocking on this for 1.9, but if not and we
>> can expect the current velocity of releases then I’d rather see if we can
>> get it in now. Especially because Spark and others will need to wait until
>> next release (>= 1 year from now?) to fix use of statistics for binary
>> data. As Rob said, this is broken in master of Spark so without a
>> fix+release of Parquet it’ll likely require disabling statistics pushdown
>> for binary columns on their end.
>>
>> -Andrew
>>
>> On 9/8/16, 11:37 PM, "Jacques Nadeau" <[email protected]> wrote:
>>
>> >Absolutely agree the fix would be good to get in.
>> >
>> >My comments come from the fact that we tried to start the 1.9 release
>> (late
>> >last year I think) after the direct memory feature got in and it never
>> >completed (as there is always a good reason to hold the release). My main
>> >fear is that if we wait for something and it turns into another
>> substantial
>> >delay before a release. Ryan, what do you think about proposing a hard
>> >deadline?
>> >
>> >On Thu, Sep 8, 2016 at 7:57 PM, Robert Kruszewski <[email protected]>
>> >wrote:
>> >
>> >> When would the fix be ready by to consider it for release in 1.9.0?
>> This
>> >> affects correctness on current master of spark and would be good to
>> get a
>> >> release with it fixed so that people aren’t using potentially bad
>> versions.
>> >>
>> >> -          Robert
>> >>
>> >> On 9/8/16, 10:44 PM, "Jacques Nadeau" <[email protected]> wrote:
>> >>
>> >>     A non-binding +1 from me on releasing sooner/more often.
>> >>
>> >>     On Thu, Sep 8, 2016 at 5:44 PM, Ryan Blue
>> <[email protected]>
>> >> wrote:
>> >>
>> >>     > Hey everyone,
>> >>     >
>> >>     > I'd like to put together a release candidate for 1.9.0. The other
>> >> issues
>> >>     > are done, but the sort order min/max issue, PARQUET-686 is still
>> >> open.
>> >>     >
>> >>     > I'm okay releasing 1.9.0 without fixing that issue since it's
>> been
>> >> so long
>> >>     > since our last release. It would also be nice to do releases as
>> >> necessary,
>> >>     > so we can always do a release to fix PARQUET-686 when the patch
>> for
>> >> it is
>> >>     > ready. Is there anyone that thinks we should definitely get this
>> >> into the
>> >>     > 1.9.0 release?
>> >>     >
>> >>     > Thanks,
>> >>     >
>> >>     > rb
>> >>     >
>> >>     > --
>> >>     > Ryan Blue
>> >>     > Software Engineer
>> >>     > Netflix
>> >>     >
>> >>
>> >>
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to