When I look at this I have one concern. How does this deal with diffrent logical types? For example if I have a date column I am going to have to convert whatever is used as a physical type to represent date and convert that to a date type. Is this correct? Is there a place where I can see documentation on how these different values are represented in their different physical types so that I can ensure that I interpret the value of min and max properly from statistics.
If I have a UINT 64 but that physical type is represented in an INT64 or INT96 (I am not sure which of these is used in this case) do I need to somehow modify the value of the physical type which is what will get returned by the TypedRowGroupStatistic? Thank you for your response. I was working late and must have had a brainfart since obviously encodingMin clearly states what it does :). Felipe On Tue, Jul 25, 2017 at 6:25 AM, Artem Tarasov <[email protected]> wrote: > Hi Felipe, > > Encode* functions do the opposite of what you want, encoding values of > any type to binary format to be stored on disk. > You need to call > static_pointer_cast<ByteArrayStatistics/Int32Statistics/etc.> and then > you'll have min()/max() methods to get the values. > > > Best, > Artem > > > On 25.07.2017 08:20, Felipe Aramburu wrote: > > I have included some code below that shows the context of where this is > > being retrieved but basically I am trying to do the following: > > > > std::shared_ptr<parquet::RowGroupStatistics> statistics = > > columnMetaData->statistics(); > > if(statistics->HasMinMax()){ > > minStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMin(); > > maxStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMax(); > > > > > > I look at the values in statistics->EncodeMin() and I am not exactly sure > > how to interpret them. What is the proper way for getting this value into > > an Int or Long or whatever C type represents the underlying data? What is > > the most concise way of retrieving the min and max values of every column > > in every row group inside of a parquet file? > > > > Any help is greatly appreciated. > > > > Felipe Aramburu > > > > > > > > for(int rowGroupIndex = 0; rowGroupIndex < num_row_groups; > > rowGroupIndex++){ > > std::shared_ptr<parquet::RowGroupReader> groupReader = > > parquet_reader->RowGroup(rowGroupIndex); > > const parquet::RowGroupMetaData* rowGroupMetadata = > groupReader->metadata(); > > for(int blazingColumnIndex = 0; blazingColumnIndex < > > blazingColumnToParquetColumn.size(); blazingColumnIndex++){ > > std::unique_ptr<parquet::ColumnChunkMetaData> columnMetaData = > > rowGroupMetadata->ColumnChunk(blazingColumnToParquetColumn[ > blazingColumnIndex]); > > const parquet::ColumnDescriptor * column = > > schema->Column(blazingColumnToParquetColumn[blazingColumnIndex]); > > > > if(columnMetaData->is_stats_set()){ > > std::shared_ptr<parquet::RowGroupStatistics> statistics = > > columnMetaData->statistics(); > > if(statistics->HasMinMax()){ > > minStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMin(); > > maxStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMax(); > > }else{ > > //set min and max max values > > minStrings[blazingColumnIndex][rowGroupIndex] = "min"; > > maxStrings[blazingColumnIndex][rowGroupIndex] = "max"; > > } > > }else{ > > //set minData to value min and maxData to value max if not statistics > exists > > minStrings[blazingColumnIndex][rowGroupIndex] = "min"; > > maxStrings[blazingColumnIndex][rowGroupIndex] = "max"; > > } > > } > > } > > > >
