Re: Parquet files with microsecond columns

James Turton Tue, 23 Jan 2024 20:20:59 -0800

A reply on your actual topic now. I think that following implementationof the int96_as_timestamp option will result in the type conversionbeing done "deeply enough" for Drill. I sympathise a lot with the designthinking in your second option but I'd personally go the first route andonly consider the second route if something wasn't working.


On 2024/01/22 11:36, Peter Franzen wrote:

Hi,


I am using Drill to query Parquet files that have fields of type 
timestamp_micros. By default, Drill truncates those microsecond
values to milliseconds when reading the Parquet files in order to convert them 
to SQL timestamps.

In some of my use cases I need to read the original microsecond values (as 
64-bit values, not SQL timestamps) through Drill, but
this doesn’t seem to be possible (unless I’ve missed something).

I have explored a possible solution to this, and would like to run it by some 
developers more experienced with the Drill code base
before I create a pull request.

My idea is to add tow options similar to 
“store.parquet.reader.int96_as_timestamp" to control whether or not microsecond
times and timestamps are truncated to milliseconds. These options would be added to 
“org.apache.drill.exec.ExecConstants" and
"org.apache.drill.exec.server.options.SystemOptionManager", and to 
drill-module.conf:

     store.parquet.reader.time_micros_as_int64: false,
     store.parquet.reader.timestamp_micros_as_int64: false,

These options would then be used in the same places as 
“store.parquet.reader.int96_as_timestamp”:

org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter

to create an int64 reader instead of a time/timestamp reader when the 
correspodning option is set to true.

In addition to this, 
“org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must be 
altered to _not_ truncate the min and max
values for time_micros/timestamp_micros if the corresponding option is true. 
This class doesn’t have a reference to an OptionManager, so
my guess is that the two new options must be extractred from the OptionManager 
when the ParquetReaderConfig instance is created.

Filtering on microsecond columns would be done using 64-bit values rather than 
TIME/TIMESTAMP values, e.g.

select *  from <file> where <timestamp_micros_column> = 1705914906694751;

I’ve tested the solution outlined above, and it seems to work when using 
sqlline and with the JDBC driver, but not with the web based interface.
Any pointers to the relevent code for that would be appreciated.

An alternative solution to the above could be to intercept all reading of the 
Parquet schemas and modifying the schema to report the
microsecond columns as int64 columns, i.e. to completely discard the 
information that the columns contain time/timestamp values.
This could potentially make parts of the code where it is not obvious that the 
time/timestamp properties of columns are used behave
as expected. However, this variant would not align with how INT96 timestamps 
are handled.

Any thoughts on this idea for how to access microsecond values would be highly 
appreciated.

Thanks,

/Peter

Re: Parquet files with microsecond columns

Reply via email to