You'd have to convert int32 into date type on your own, yes. Logical types aren't handled nicely at the moment, see https://issues.apache.org/jira/browse/PARQUET-1002 Correspondence between logical and physical types is specified in parquet.thrift, see ConvertedType enum there.
Artem On 25.07.2017 17:10, Felipe Aramburu wrote: > When I look at this I have one concern. How does this deal with diffrent > logical types? For example if I have a date column I am going to have to > convert whatever is used as a physical type to represent date and convert > that to a date type. Is this correct? Is there a place where I can see > documentation on how these different values are represented in their > different physical types so that I can ensure that I interpret the value of > min and max properly from statistics. > > If I have a UINT 64 but that physical type is represented in an INT64 or > INT96 (I am not sure which of these is used in this case) do I need to > somehow modify the value of the physical type which is what will get > returned by the TypedRowGroupStatistic? > > Thank you for your response. I was working late and must have had a > brainfart since obviously encodingMin clearly states what it does :). > > Felipe > > On Tue, Jul 25, 2017 at 6:25 AM, Artem Tarasov <[email protected]> wrote: > >> Hi Felipe, >> >> Encode* functions do the opposite of what you want, encoding values of >> any type to binary format to be stored on disk. >> You need to call >> static_pointer_cast<ByteArrayStatistics/Int32Statistics/etc.> and then >> you'll have min()/max() methods to get the values. >> >> >> Best, >> Artem >> >> >> On 25.07.2017 08:20, Felipe Aramburu wrote: >>> I have included some code below that shows the context of where this is >>> being retrieved but basically I am trying to do the following: >>> >>> std::shared_ptr<parquet::RowGroupStatistics> statistics = >>> columnMetaData->statistics(); >>> if(statistics->HasMinMax()){ >>> minStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMin(); >>> maxStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMax(); >>> >>> >>> I look at the values in statistics->EncodeMin() and I am not exactly sure >>> how to interpret them. What is the proper way for getting this value into >>> an Int or Long or whatever C type represents the underlying data? What is >>> the most concise way of retrieving the min and max values of every column >>> in every row group inside of a parquet file? >>> >>> Any help is greatly appreciated. >>> >>> Felipe Aramburu >>> >>> >>> >>> for(int rowGroupIndex = 0; rowGroupIndex < num_row_groups; >>> rowGroupIndex++){ >>> std::shared_ptr<parquet::RowGroupReader> groupReader = >>> parquet_reader->RowGroup(rowGroupIndex); >>> const parquet::RowGroupMetaData* rowGroupMetadata = >> groupReader->metadata(); >>> for(int blazingColumnIndex = 0; blazingColumnIndex < >>> blazingColumnToParquetColumn.size(); blazingColumnIndex++){ >>> std::unique_ptr<parquet::ColumnChunkMetaData> columnMetaData = >>> rowGroupMetadata->ColumnChunk(blazingColumnToParquetColumn[ >> blazingColumnIndex]); >>> const parquet::ColumnDescriptor * column = >>> schema->Column(blazingColumnToParquetColumn[blazingColumnIndex]); >>> >>> if(columnMetaData->is_stats_set()){ >>> std::shared_ptr<parquet::RowGroupStatistics> statistics = >>> columnMetaData->statistics(); >>> if(statistics->HasMinMax()){ >>> minStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMin(); >>> maxStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMax(); >>> }else{ >>> //set min and max max values >>> minStrings[blazingColumnIndex][rowGroupIndex] = "min"; >>> maxStrings[blazingColumnIndex][rowGroupIndex] = "max"; >>> } >>> }else{ >>> //set minData to value min and maxData to value max if not statistics >> exists >>> minStrings[blazingColumnIndex][rowGroupIndex] = "min"; >>> maxStrings[blazingColumnIndex][rowGroupIndex] = "max"; >>> } >>> } >>> } >>> >>
