So I see descriptions for some of the logical types here
https://github.com/Parquet/parquet-format/blob/master/src/thrift/parquet.thrift

But in terms of converting unsigned to signed representation. Are you just
shifting?

So if I have a UINT64 in Parquet as the logical type and the physical type
is INT64. Is the binary representation in INT64 that is stored in
statistics converted to UINT64 by adding min_possible_value_of(INT64)?

Thanks for  your help again.

On Tue, Jul 25, 2017 at 10:35 AM, Artem Tarasov <[email protected]>
wrote:

> You'd have to convert int32 into date type on your own, yes.
> Logical types aren't handled nicely at the moment, see
> https://issues.apache.org/jira/browse/PARQUET-1002
> Correspondence between logical and physical types is specified in
> parquet.thrift, see ConvertedType enum there.
>
> Artem
>
>
> On 25.07.2017 17:10, Felipe Aramburu wrote:
> > When I look at this I have one concern. How does this deal with diffrent
> > logical types? For example if I have a date column I am going to have to
> > convert whatever is used as a physical type to represent date and convert
> > that to a date type. Is this correct? Is there a place where I can see
> > documentation on how these different values are represented in their
> > different physical types so that I can ensure that I interpret the value
> of
> > min and max properly from statistics.
> >
> > If I have a UINT 64 but that physical type is represented in an INT64 or
> > INT96 (I am not sure which of these is used in this case) do I need to
> > somehow modify the value of the physical type which is what will get
> > returned by the TypedRowGroupStatistic?
> >
> > Thank you for your response. I was working late and must have had a
> > brainfart since obviously encodingMin clearly states what it does :).
> >
> > Felipe
> >
> > On Tue, Jul 25, 2017 at 6:25 AM, Artem Tarasov <[email protected]>
> wrote:
> >
> >> Hi Felipe,
> >>
> >> Encode* functions do the opposite of what you want, encoding values of
> >> any type to binary format to be stored on disk.
> >> You need to call
> >> static_pointer_cast<ByteArrayStatistics/Int32Statistics/etc.> and then
> >> you'll have min()/max() methods to get the values.
> >>
> >>
> >> Best,
> >> Artem
> >>
> >>
> >> On 25.07.2017 08:20, Felipe Aramburu wrote:
> >>> I have included some code below that shows the context of where this is
> >>> being retrieved but basically I am trying to do the following:
> >>>
> >>> std::shared_ptr<parquet::RowGroupStatistics> statistics =
> >>> columnMetaData->statistics();
> >>> if(statistics->HasMinMax()){
> >>> minStrings[blazingColumnIndex][rowGroupIndex] =
> statistics->EncodeMin();
> >>> maxStrings[blazingColumnIndex][rowGroupIndex] =
> statistics->EncodeMax();
> >>>
> >>>
> >>> I look at the values in statistics->EncodeMin() and I am not exactly
> sure
> >>> how to interpret them. What is the proper way for getting this value
> into
> >>> an Int or Long or whatever C type represents the underlying data? What
> is
> >>> the most concise way of retrieving the min and max values of every
> column
> >>> in every row group inside of a parquet file?
> >>>
> >>> Any help is greatly appreciated.
> >>>
> >>> Felipe Aramburu
> >>>
> >>>
> >>>
> >>> for(int rowGroupIndex  = 0; rowGroupIndex < num_row_groups;
> >>> rowGroupIndex++){
> >>> std::shared_ptr<parquet::RowGroupReader> groupReader =
> >>> parquet_reader->RowGroup(rowGroupIndex);
> >>> const parquet::RowGroupMetaData* rowGroupMetadata =
> >> groupReader->metadata();
> >>> for(int blazingColumnIndex = 0; blazingColumnIndex <
> >>> blazingColumnToParquetColumn.size(); blazingColumnIndex++){
> >>> std::unique_ptr<parquet::ColumnChunkMetaData> columnMetaData =
> >>> rowGroupMetadata->ColumnChunk(blazingColumnToParquetColumn[
> >> blazingColumnIndex]);
> >>> const parquet::ColumnDescriptor * column =
> >>> schema->Column(blazingColumnToParquetColumn[blazingColumnIndex]);
> >>>
> >>> if(columnMetaData->is_stats_set()){
> >>> std::shared_ptr<parquet::RowGroupStatistics> statistics =
> >>> columnMetaData->statistics();
> >>> if(statistics->HasMinMax()){
> >>> minStrings[blazingColumnIndex][rowGroupIndex] =
> statistics->EncodeMin();
> >>> maxStrings[blazingColumnIndex][rowGroupIndex] =
> statistics->EncodeMax();
> >>> }else{
> >>> //set min and max max values
> >>> minStrings[blazingColumnIndex][rowGroupIndex] = "min";
> >>> maxStrings[blazingColumnIndex][rowGroupIndex] = "max";
> >>> }
> >>> }else{
> >>> //set minData to value min and maxData to value max if not statistics
> >> exists
> >>> minStrings[blazingColumnIndex][rowGroupIndex] = "min";
> >>> maxStrings[blazingColumnIndex][rowGroupIndex] = "max";
> >>> }
> >>> }
> >>> }
> >>>
> >>
>
>

Reply via email to