So I see descriptions for some of the logical types here https://github.com/Parquet/parquet-format/blob/master/src/thrift/parquet.thrift
But in terms of converting unsigned to signed representation. Are you just shifting? So if I have a UINT64 in Parquet as the logical type and the physical type is INT64. Is the binary representation in INT64 that is stored in statistics converted to UINT64 by adding min_possible_value_of(INT64)? Thanks for your help again. On Tue, Jul 25, 2017 at 10:35 AM, Artem Tarasov <[email protected]> wrote: > You'd have to convert int32 into date type on your own, yes. > Logical types aren't handled nicely at the moment, see > https://issues.apache.org/jira/browse/PARQUET-1002 > Correspondence between logical and physical types is specified in > parquet.thrift, see ConvertedType enum there. > > Artem > > > On 25.07.2017 17:10, Felipe Aramburu wrote: > > When I look at this I have one concern. How does this deal with diffrent > > logical types? For example if I have a date column I am going to have to > > convert whatever is used as a physical type to represent date and convert > > that to a date type. Is this correct? Is there a place where I can see > > documentation on how these different values are represented in their > > different physical types so that I can ensure that I interpret the value > of > > min and max properly from statistics. > > > > If I have a UINT 64 but that physical type is represented in an INT64 or > > INT96 (I am not sure which of these is used in this case) do I need to > > somehow modify the value of the physical type which is what will get > > returned by the TypedRowGroupStatistic? > > > > Thank you for your response. I was working late and must have had a > > brainfart since obviously encodingMin clearly states what it does :). > > > > Felipe > > > > On Tue, Jul 25, 2017 at 6:25 AM, Artem Tarasov <[email protected]> > wrote: > > > >> Hi Felipe, > >> > >> Encode* functions do the opposite of what you want, encoding values of > >> any type to binary format to be stored on disk. > >> You need to call > >> static_pointer_cast<ByteArrayStatistics/Int32Statistics/etc.> and then > >> you'll have min()/max() methods to get the values. > >> > >> > >> Best, > >> Artem > >> > >> > >> On 25.07.2017 08:20, Felipe Aramburu wrote: > >>> I have included some code below that shows the context of where this is > >>> being retrieved but basically I am trying to do the following: > >>> > >>> std::shared_ptr<parquet::RowGroupStatistics> statistics = > >>> columnMetaData->statistics(); > >>> if(statistics->HasMinMax()){ > >>> minStrings[blazingColumnIndex][rowGroupIndex] = > statistics->EncodeMin(); > >>> maxStrings[blazingColumnIndex][rowGroupIndex] = > statistics->EncodeMax(); > >>> > >>> > >>> I look at the values in statistics->EncodeMin() and I am not exactly > sure > >>> how to interpret them. What is the proper way for getting this value > into > >>> an Int or Long or whatever C type represents the underlying data? What > is > >>> the most concise way of retrieving the min and max values of every > column > >>> in every row group inside of a parquet file? > >>> > >>> Any help is greatly appreciated. > >>> > >>> Felipe Aramburu > >>> > >>> > >>> > >>> for(int rowGroupIndex = 0; rowGroupIndex < num_row_groups; > >>> rowGroupIndex++){ > >>> std::shared_ptr<parquet::RowGroupReader> groupReader = > >>> parquet_reader->RowGroup(rowGroupIndex); > >>> const parquet::RowGroupMetaData* rowGroupMetadata = > >> groupReader->metadata(); > >>> for(int blazingColumnIndex = 0; blazingColumnIndex < > >>> blazingColumnToParquetColumn.size(); blazingColumnIndex++){ > >>> std::unique_ptr<parquet::ColumnChunkMetaData> columnMetaData = > >>> rowGroupMetadata->ColumnChunk(blazingColumnToParquetColumn[ > >> blazingColumnIndex]); > >>> const parquet::ColumnDescriptor * column = > >>> schema->Column(blazingColumnToParquetColumn[blazingColumnIndex]); > >>> > >>> if(columnMetaData->is_stats_set()){ > >>> std::shared_ptr<parquet::RowGroupStatistics> statistics = > >>> columnMetaData->statistics(); > >>> if(statistics->HasMinMax()){ > >>> minStrings[blazingColumnIndex][rowGroupIndex] = > statistics->EncodeMin(); > >>> maxStrings[blazingColumnIndex][rowGroupIndex] = > statistics->EncodeMax(); > >>> }else{ > >>> //set min and max max values > >>> minStrings[blazingColumnIndex][rowGroupIndex] = "min"; > >>> maxStrings[blazingColumnIndex][rowGroupIndex] = "max"; > >>> } > >>> }else{ > >>> //set minData to value min and maxData to value max if not statistics > >> exists > >>> minStrings[blazingColumnIndex][rowGroupIndex] = "min"; > >>> maxStrings[blazingColumnIndex][rowGroupIndex] = "max"; > >>> } > >>> } > >>> } > >>> > >> > >
