Re: Parquet files with microsecond columns

Peter Franzen Wed, 24 Jan 2024 00:46:07 -0800

Hi,

Thanks for the feedback. The first variant seems to work, so I’ll go with that.


/Peter

> On 24 Jan 2024, at 05:20, James Turton <dz...@apache.org> wrote:
> 
> A reply on your actual topic now. I think that following implementation of 
> the int96_as_timestamp option will result in the type conversion being done 
> "deeply enough" for Drill. I sympathise a lot with the design thinking in 
> your second option but I'd personally go the first route and only consider 
> the second route if something wasn't working.
> 
> On 2024/01/22 11:36, Peter Franzen wrote:
>> Hi,
>> 
>> I am using Drill to query Parquet files that have fields of type 
>> timestamp_micros. By default, Drill truncates those microsecond
>> values to milliseconds when reading the Parquet files in order to convert 
>> them to SQL timestamps.
>> 
>> In some of my use cases I need to read the original microsecond values (as 
>> 64-bit values, not SQL timestamps) through Drill, but
>> this doesn’t seem to be possible (unless I’ve missed something).
>> 
>> I have explored a possible solution to this, and would like to run it by 
>> some developers more experienced with the Drill code base
>> before I create a pull request.
>> 
>> My idea is to add tow options similar to 
>> “store.parquet.reader.int96_as_timestamp" to control whether or not 
>> microsecond
>> times and timestamps are truncated to milliseconds. These options would be 
>> added to “org.apache.drill.exec.ExecConstants" and
>> "org.apache.drill.exec.server.options.SystemOptionManager", and to 
>> drill-module.conf:
>> 
>>     store.parquet.reader.time_micros_as_int64: false,
>>     store.parquet.reader.timestamp_micros_as_int64: false,
>> 
>> These options would then be used in the same places as 
>> “store.parquet.reader.int96_as_timestamp”:
>> 
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
>> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter
>> 
>> to create an int64 reader instead of a time/timestamp reader when the 
>> correspodning option is set to true.
>> 
>> In addition to this, 
>> “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must be 
>> altered to _not_ truncate the min and max
>> values for time_micros/timestamp_micros if the corresponding option is true. 
>> This class doesn’t have a reference to an OptionManager, so
>> my guess is that the two new options must be extractred from the 
>> OptionManager when the ParquetReaderConfig instance is created.
>> 
>> Filtering on microsecond columns would be done using 64-bit values rather 
>> than TIME/TIMESTAMP values, e.g.
>> 
>> select *  from <file> where <timestamp_micros_column> = 1705914906694751;
>> 
>> I’ve tested the solution outlined above, and it seems to work when using 
>> sqlline and with the JDBC driver, but not with the web based interface.
>> Any pointers to the relevent code for that would be appreciated.
>> 
>> An alternative solution to the above could be to intercept all reading of 
>> the Parquet schemas and modifying the schema to report the
>> microsecond columns as int64 columns, i.e. to completely discard the 
>> information that the columns contain time/timestamp values.
>> This could potentially make parts of the code where it is not obvious that 
>> the time/timestamp properties of columns are used behave
>> as expected. However, this variant would not align with how INT96 timestamps 
>> are handled.
>> 
>> Any thoughts on this idea for how to access microsecond values would be 
>> highly appreciated.
>> 
>> Thanks,
>> 
>> /Peter
>> 
>

Re: Parquet files with microsecond columns

Reply via email to