You'd have to convert int32 into date type on your own, yes.
Logical types aren't handled nicely at the moment, see
https://issues.apache.org/jira/browse/PARQUET-1002
Correspondence between logical and physical types is specified in
parquet.thrift, see ConvertedType enum there.

Artem


On 25.07.2017 17:10, Felipe Aramburu wrote:
> When I look at this I have one concern. How does this deal with diffrent
> logical types? For example if I have a date column I am going to have to
> convert whatever is used as a physical type to represent date and convert
> that to a date type. Is this correct? Is there a place where I can see
> documentation on how these different values are represented in their
> different physical types so that I can ensure that I interpret the value of
> min and max properly from statistics.
>
> If I have a UINT 64 but that physical type is represented in an INT64 or
> INT96 (I am not sure which of these is used in this case) do I need to
> somehow modify the value of the physical type which is what will get
> returned by the TypedRowGroupStatistic?
>
> Thank you for your response. I was working late and must have had a
> brainfart since obviously encodingMin clearly states what it does :).
>
> Felipe
>
> On Tue, Jul 25, 2017 at 6:25 AM, Artem Tarasov <[email protected]> wrote:
>
>> Hi Felipe,
>>
>> Encode* functions do the opposite of what you want, encoding values of
>> any type to binary format to be stored on disk.
>> You need to call
>> static_pointer_cast<ByteArrayStatistics/Int32Statistics/etc.> and then
>> you'll have min()/max() methods to get the values.
>>
>>
>> Best,
>> Artem
>>
>>
>> On 25.07.2017 08:20, Felipe Aramburu wrote:
>>> I have included some code below that shows the context of where this is
>>> being retrieved but basically I am trying to do the following:
>>>
>>> std::shared_ptr<parquet::RowGroupStatistics> statistics =
>>> columnMetaData->statistics();
>>> if(statistics->HasMinMax()){
>>> minStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMin();
>>> maxStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMax();
>>>
>>>
>>> I look at the values in statistics->EncodeMin() and I am not exactly sure
>>> how to interpret them. What is the proper way for getting this value into
>>> an Int or Long or whatever C type represents the underlying data? What is
>>> the most concise way of retrieving the min and max values of every column
>>> in every row group inside of a parquet file?
>>>
>>> Any help is greatly appreciated.
>>>
>>> Felipe Aramburu
>>>
>>>
>>>
>>> for(int rowGroupIndex  = 0; rowGroupIndex < num_row_groups;
>>> rowGroupIndex++){
>>> std::shared_ptr<parquet::RowGroupReader> groupReader =
>>> parquet_reader->RowGroup(rowGroupIndex);
>>> const parquet::RowGroupMetaData* rowGroupMetadata =
>> groupReader->metadata();
>>> for(int blazingColumnIndex = 0; blazingColumnIndex <
>>> blazingColumnToParquetColumn.size(); blazingColumnIndex++){
>>> std::unique_ptr<parquet::ColumnChunkMetaData> columnMetaData =
>>> rowGroupMetadata->ColumnChunk(blazingColumnToParquetColumn[
>> blazingColumnIndex]);
>>> const parquet::ColumnDescriptor * column =
>>> schema->Column(blazingColumnToParquetColumn[blazingColumnIndex]);
>>>
>>> if(columnMetaData->is_stats_set()){
>>> std::shared_ptr<parquet::RowGroupStatistics> statistics =
>>> columnMetaData->statistics();
>>> if(statistics->HasMinMax()){
>>> minStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMin();
>>> maxStrings[blazingColumnIndex][rowGroupIndex] = statistics->EncodeMax();
>>> }else{
>>> //set min and max max values
>>> minStrings[blazingColumnIndex][rowGroupIndex] = "min";
>>> maxStrings[blazingColumnIndex][rowGroupIndex] = "max";
>>> }
>>> }else{
>>> //set minData to value min and maxData to value max if not statistics
>> exists
>>> minStrings[blazingColumnIndex][rowGroupIndex] = "min";
>>> maxStrings[blazingColumnIndex][rowGroupIndex] = "max";
>>> }
>>> }
>>> }
>>>
>>

Reply via email to