Does Parquet support page-level filtering as well as rowgroup-level filtering?
2015-09-18 15:06 GMT+09:00 Hyukjin Kwon <[email protected]>: > Just in cast, what I meant skipping pages (or row groups) with statistics > is, filtering them by the comparison with the given value for filter2 and > the statistics such as min, max and etc for DataPageHeader and > ColumnMetadata. > > Thanks! > > 2015-09-18 14:58 GMT+09:00 Hyukjin Kwon <[email protected]>: > >> I see. >> >> However, does filtering at RowMaterializer (with >> IncrementallyUpdatedFilterPredicate as filter2) actually happen after >> reading the values for a row of the pages (in the columns of the row)? >> >> I just wonder if some pages can be skipped by the statistics in >> DataPageHeader before actually reading the data part of the pages in >> order to reduce the cost of io, decompression and decode, >> >> just like skipping row groups by the statistics in ColumnMetaData (in a >> split) before actually starting to read a Parquet file. >> >> >> Although I know I am pretty wrong, for example, I could find >> ColumnChunkPageReadStore.ColumnChunkPageReader.readPage() function to >> read actual page data. >> >> >> public DataPage visit(DataPageV2 dataPageV2) { >> >> if (!dataPageV2.isCompressed()) { >> return dataPageV2; >> } >> try { >> int uncompressedSize = Ints.checkedCast( >> dataPageV2.getUncompressedSize() >> - dataPageV2.getDefinitionLevels().size() >> - dataPageV2.getRepetitionLevels().size()); >> return DataPageV2.uncompressed( >> dataPageV2.getRowCount(), >> dataPageV2.getNullCount(), >> dataPageV2.getValueCount(), >> dataPageV2.getRepetitionLevels(), >> dataPageV2.getDefinitionLevels(), >> dataPageV2.getDataEncoding(), >> *decompressor.decompress(dataPageV2.getData(), uncompressedSize),* >> dataPageV2.getStatistics() >> ); >> } catch (IOException e) { >> throw new ParquetDecodingException("could not decompress page", e); >> } >> } >> >> >> I think we can skip the page here actually without decompress & decode >> filtering by given filter value and statistics in DataPageHeader. >> >> Are there some logics for this skipping function? >> >> >> Thanks! >> >> >> >> 2015-09-18 2:31 GMT+09:00 Ryan Blue <[email protected]>: >> >>> Hi Hyukjin, >>> >>> I think the code you're looking for is created by parquet-generator so >>> we have one specific to each primitive type: >>> >>> >>> >>> https://github.com/apache/parquet-mr/blob/master/parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java >>> >>> rb >>> >>> >>> On 09/16/2015 06:57 PM, Hyukjin Kwon wrote: >>> >>>> Hi all, >>>> >>>> I am pretty new to Parquet and trying to learn Parquet structure. >>>> >>>> I assume that min, max and etc information has been stored for both >>>> ColumnMetaData and also DataPageHeader since 1.6.0 ( >>>> https://github.com/Parquet/parquet-mr/pull/338) >>>> >>>> I see the statistics in ColumnMetaData is used to filter blocks (or row >>>> groups) as filter2 at RowGroupFilter by calling canDrop(). >>>> >>>> I though the statistics in DataPageHeader is used to not to read a page >>>> by >>>> reading the statistics. >>>> However, my question is, I could not find where to use statistics in >>>> DataPageHeader for filter1 and also filter2. >>>> >>>> >>>> Could you give me some comments on this please? >>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Cloudera, Inc. >>> >> >> >
