Re: Parquet min/max statistics & null values

2017-11-01 Thread Lars Volker
Hi Bruno,

Parquet-mr currently doesn't support the new stats fields:
https://issues.apache.org/jira/browse/PARQUET-1025

For development purposes I used a small Python script a while ago. I pushed
it here so you can check it out if you'd like, but it's really only meant
for development purposes:
https://github.com/lekv/impala/tree/dump_parquet_stats

Cheers, Lars


On Sat, Oct 28, 2017 at 1:19 PM, Bruno Quinart  wrote:

> Hi
>
> Thanks for your replies.
>
> An example query:
> SELECT * FROM reasonably_sized_partitioned_table
> WHERE partition_key_dtnr = 20171028
> AND sorted_column = 56789123456
> (same effect with an IN instead of the last equality)
>
> I checked the behavior in our cluster again and can now confirm Lars
> explanation:
> - Filtering on stats works well as from 2.9, except when there is only
> null values for the column in a parquet file
> - And that would be addressed by IMPALA-6113
> I was wrongly assuming that in some of my tables there would not have been
> enough null values to fill a parquet file for each partition.
>
> I wonder how you verify those stats in Parquet metadata. Parquet-tools
> currently does not print them.
>
> Thanks for the help
> Bruno
>
> On 27 Oct, 2017, at 12:54 AM, Lars Volker wrote:
>
> Hi Bruno,
>
> To clarify on your original observation: Column chunks that only have null
> values in them will not have their min_value and max_value fields populated
> and thus won't be skipped based on stats. I filed IMPALA-6113
>  to track this.
> IMPALA-5061  added
> support to populate the null_count in statistics, allowing us to detect
> column chunks that only contain NULLs. We should use that information to
> skip row groups if the predicate allows us to.
>
> Row groups with column chunks that have at least one non-null value should
> get filtered correctly.
>
> Cheers, Lars
>
> On Thu, Oct 26, 2017 at 10:02 AM, Tim Armstrong wrote:
>
> Hi Bruno,
>
> Could you provide an example of the specific predicates that aren't being
>
> used to successfully skip the row group?
>
>
> - Tim
>
>
> On Thu, Oct 26, 2017 at 7:21 AM, Jeszy wrote:
>
>
> Hello Bruno,
>
>
> Thanks for bringing this up. While not apparent from the commit
>
> comments, this limitation was mentioned during the code review:
>
> 'min/max are only set when there are non-null values, so we don't
>
> consider statistics for "is null".' (see
>
> https://gerrit.cloudera.org/#/c/6147/).
>
> It looks to me that this was intended, but I'll let others confirm.
>
> Definitely a point where we can improve.
>
>
> Thanks!
>
>
> On 26 October 2017 at 08:02, Bruno Quinart wrote:
>
> > Hi all
>
> >
>
> > With IMPALA-2328, Parquet row group statistics are now being used to
>
> skip
>
> > the row group completely if the min/max range is excluded from the
>
> > predicate.
>
> > We have a use case in which we make sure the data is sorted on a 'key'
>
> and
>
> > have then many selective queries on that 'key' field. We notice a
>
> > significant performance increase.
>
> > So thanks a lot for all the work on that!
>
> >
>
> > One thing we notice is an unexpected behavior for records where that
>
> 'key'
>
> > has null values. It seems that as soon as null values are present in a
>
> row
>
> > group, the test on the min/max fails and the row group is read.
>
> >
>
> > We work with Impala 2.9. The data is put in parquet files by Impala
>
> itself.
>
> > We have noticed this effect for both bigint as decimal fields. Note that
>
> > it's difficult for me to extract the min/max statistics from the parquet
>
> > files. The parquet-tools included in our distribution (5.12) is not the
>
> > latest. And I was told PARQUET-327 would anyway not print the those row
>
> > group stats because of the way Impala stores them.
>
> > We do confirm the expected behavior (exactly one row group read for
>
> properly
>
> > sorted data) when we create a similar table but explicitly filter out
>
> all
>
> > null values for that 'key' field. We also notice that the the number of
>
> row
>
> > groups read (but zero records retained) is proportional to the number of
>
> > null values.
>
> >
>
> > Is this behavior expected?
>
> > Is there a fundamental reason those row groups can not be skipped?
>
> >
>
> > Thanks!
>
> > Bruno
>
> >
>
>
>
>
>


Re: Parquet min/max statistics & null values

2017-10-28 Thread Bruno Quinart

Hi

Thanks for your replies.

An example query:
SELECT * FROM reasonably_sized_partitioned_table
WHERE partition_key_dtnr = 20171028
AND sorted_column = 56789123456
(same effect with an IN instead of the last equality)

I checked the behavior in our cluster again and can now confirm Lars 
explanation:
- Filtering on stats works well as from 2.9, except when there is only null 
values for the column in a parquet file
- And that would be addressed by IMPALA-6113
I was wrongly assuming that in some of my tables there would not have been 
enough null values to fill a parquet file for each partition.

I wonder how you verify those stats in Parquet metadata. Parquet-tools 
currently does not print them.

Thanks for the help
Bruno

On 27 Oct, 2017, at 12:54 AM, Lars Volker wrote:

Hi Bruno,

To clarify on your original observation: Column chunks that only have null
values in them will not have their min_value and max_value fields populated
and thus won't be skipped based on stats. I filed IMPALA-6113
 to track this.
IMPALA-5061  added
support to populate the null_count in statistics, allowing us to detect
column chunks that only contain NULLs. We should use that information to
skip row groups if the predicate allows us to.

Row groups with column chunks that have at least one non-null value should
get filtered correctly.

Cheers, Lars

On Thu, Oct 26, 2017 at 10:02 AM, Tim Armstrong wrote:

Hi Bruno,
Could you provide an example of the specific predicates that aren't being
used to successfully skip the row group?

- Tim

On Thu, Oct 26, 2017 at 7:21 AM, Jeszy wrote:

Hello Bruno,

Thanks for bringing this up. While not apparent from the commit
comments, this limitation was mentioned during the code review:
'min/max are only set when there are non-null values, so we don't
consider statistics for "is null".' (see
https://gerrit.cloudera.org/#/c/6147/).
It looks to me that this was intended, but I'll let others confirm.
Definitely a point where we can improve.

Thanks!

On 26 October 2017 at 08:02, Bruno Quinart wrote:

Hi all

With IMPALA-2328, Parquet row group statistics are now being used to

skip

the row group completely if the min/max range is excluded from the
predicate.
We have a use case in which we make sure the data is sorted on a 'key'

and

have then many selective queries on that 'key' field. We notice a
significant performance increase.
So thanks a lot for all the work on that!

One thing we notice is an unexpected behavior for records where that

'key'

has null values. It seems that as soon as null values are present in a

row

group, the test on the min/max fails and the row group is read.

We work with Impala 2.9. The data is put in parquet files by Impala

itself.

We have noticed this effect for both bigint as decimal fields. Note that
it's difficult for me to extract the min/max statistics from the parquet
files. The parquet-tools included in our distribution (5.12) is not the
latest. And I was told PARQUET-327 would anyway not print the those row
group stats because of the way Impala stores them.
We do confirm the expected behavior (exactly one row group read for

properly

sorted data) when we create a similar table but explicitly filter out

all

null values for that 'key' field. We also notice that the the number of

row

groups read (but zero records retained) is proportional to the number of
null values.

Is this behavior expected?
Is there a fundamental reason those row groups can not be skipped?

Thanks!
Bruno







Re: Parquet min/max statistics & null values

2017-10-26 Thread Tim Armstrong
Hi Bruno,
 Could you provide an example of the specific predicates that aren't being
used to successfully skip the row group?

- Tim

On Thu, Oct 26, 2017 at 7:21 AM, Jeszy  wrote:

> Hello Bruno,
>
> Thanks for bringing this up. While not apparent from the commit
> comments, this limitation was mentioned during the code review:
> 'min/max are only set when there are non-null values, so we don't
> consider statistics for "is null".' (see
> https://gerrit.cloudera.org/#/c/6147/).
> It looks to me that this was intended, but I'll let others confirm.
> Definitely a point where we can improve.
>
> Thanks!
>
> On 26 October 2017 at 08:02, Bruno Quinart  wrote:
> > Hi all
> >
> > With IMPALA-2328, Parquet row group statistics are now being used to skip
> > the row group completely if the min/max range is excluded from the
> > predicate.
> > We have a use case in which we make sure the data is sorted on a 'key'
> and
> > have then many selective queries on that 'key' field. We notice a
> > significant performance increase.
> > So thanks a lot for all the work on that!
> >
> > One thing we notice is an unexpected behavior for records where that
> 'key'
> > has null values. It seems that as soon as null values are present in a
> row
> > group, the test on the min/max fails and the row group is read.
> >
> > We work with Impala 2.9. The data is put in parquet files by Impala
> itself.
> > We have noticed this effect for both bigint as decimal fields. Note that
> > it's difficult for me to extract the min/max statistics from the parquet
> > files. The parquet-tools included in our distribution (5.12) is not the
> > latest. And I was told PARQUET-327 would anyway not print the those row
> > group stats because of the way Impala stores them.
> > We do confirm the expected behavior (exactly one row group read for
> properly
> > sorted data) when we create a similar table but explicitly filter out all
> > null values for that 'key' field. We also notice that the the number of
> row
> > groups read (but zero records retained) is proportional to the number of
> > null values.
> >
> > Is this behavior expected?
> > Is there a fundamental reason those row groups can not be skipped?
> >
> > Thanks!
> > Bruno
> >
>