Re: Parquet, usage of statistics in DataPageHeader

Hyukjin Kwon Thu, 17 Sep 2015 23:07:44 -0700

Just in cast, what I meant skipping pages (or row groups) with statistics
is, filtering them by the comparison with the given value for filter2 and
the statistics such as min, max and etc  for DataPageHeader and
ColumnMetadata.


Thanks!

2015-09-18 14:58 GMT+09:00 Hyukjin Kwon <[email protected]>:

> I see.
>
> However, does filtering at RowMaterializer (with
> IncrementallyUpdatedFilterPredicate as filter2) actually happen after
> reading the values for a row of the pages (in the columns of the row)?
>
> I just wonder if some pages can be skipped by the statistics in
> DataPageHeader before actually reading the data part of the pages in
> order to reduce the cost of io, decompression and decode,
>
> just like skipping row groups by the statistics in ColumnMetaData (in a
> split) before actually starting to read a Parquet file.
>
>
> Although I know I am pretty wrong, for example, I could find
> ColumnChunkPageReadStore.ColumnChunkPageReader.readPage() function to
> read actual page data.
>
>
> public DataPage visit(DataPageV2 dataPageV2) {
>
>   if (!dataPageV2.isCompressed()) {
>     return dataPageV2;
>   }
>   try {
>     int uncompressedSize = Ints.checkedCast(
>         dataPageV2.getUncompressedSize()
>         - dataPageV2.getDefinitionLevels().size()
>         - dataPageV2.getRepetitionLevels().size());
>     return DataPageV2.uncompressed(
>         dataPageV2.getRowCount(),
>         dataPageV2.getNullCount(),
>         dataPageV2.getValueCount(),
>         dataPageV2.getRepetitionLevels(),
>         dataPageV2.getDefinitionLevels(),
>         dataPageV2.getDataEncoding(),
>         *decompressor.decompress(dataPageV2.getData(), uncompressedSize),*
>         dataPageV2.getStatistics()
>         );
>   } catch (IOException e) {
>     throw new ParquetDecodingException("could not decompress page", e);
>   }
> }
>
>
> I think we can skip the page here actually without decompress & decode
> filtering by given filter value and statistics in DataPageHeader.
> 
> Are there some logics for this skipping function?
>
>
> Thanks!
>
>
>
> 2015-09-18 2:31 GMT+09:00 Ryan Blue <[email protected]>:
>
>> Hi Hyukjin,
>>
>> I think the code you're looking for is created by parquet-generator so we
>> have one specific to each primitive type:
>>
>>
>>
>> https://github.com/apache/parquet-mr/blob/master/parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
>>
>> rb
>>
>>
>> On 09/16/2015 06:57 PM, Hyukjin Kwon wrote:
>>
>>> Hi all,
>>>
>>> I am pretty new to Parquet and trying to learn Parquet structure.
>>>
>>> I assume that min, max and etc information has been stored for both
>>> ColumnMetaData and also DataPageHeader since 1.6.0 (
>>> https://github.com/Parquet/parquet-mr/pull/338)
>>>
>>> I see the statistics in ColumnMetaData is used to filter blocks (or row
>>> groups) as filter2 at RowGroupFilter by calling canDrop().
>>>
>>> I though the statistics in DataPageHeader is used to not to read a page
>>> by
>>> reading the statistics.
>>> However, my question is, I could not find where to use statistics in
>>> DataPageHeader for filter1 and also filter2.
>>> 
>>>
>>> Could you give me some comments on this please?
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>
>

Re: Parquet, usage of statistics in DataPageHeader

Reply via email to