Hi, Thanks for the feedback. The first variant seems to work, so I’ll go with that.
/Peter > On 24 Jan 2024, at 05:20, James Turton <dz...@apache.org> wrote: > > A reply on your actual topic now. I think that following implementation of > the int96_as_timestamp option will result in the type conversion being done > "deeply enough" for Drill. I sympathise a lot with the design thinking in > your second option but I'd personally go the first route and only consider > the second route if something wasn't working. > > On 2024/01/22 11:36, Peter Franzen wrote: >> Hi, >> >> I am using Drill to query Parquet files that have fields of type >> timestamp_micros. By default, Drill truncates those microsecond >> values to milliseconds when reading the Parquet files in order to convert >> them to SQL timestamps. >> >> In some of my use cases I need to read the original microsecond values (as >> 64-bit values, not SQL timestamps) through Drill, but >> this doesn’t seem to be possible (unless I’ve missed something). >> >> I have explored a possible solution to this, and would like to run it by >> some developers more experienced with the Drill code base >> before I create a pull request. >> >> My idea is to add tow options similar to >> “store.parquet.reader.int96_as_timestamp" to control whether or not >> microsecond >> times and timestamps are truncated to milliseconds. These options would be >> added to “org.apache.drill.exec.ExecConstants" and >> "org.apache.drill.exec.server.options.SystemOptionManager", and to >> drill-module.conf: >> >> store.parquet.reader.time_micros_as_int64: false, >> store.parquet.reader.timestamp_micros_as_int64: false, >> >> These options would then be used in the same places as >> “store.parquet.reader.int96_as_timestamp”: >> >> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory >> org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter >> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter >> >> to create an int64 reader instead of a time/timestamp reader when the >> correspodning option is set to true. >> >> In addition to this, >> “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must be >> altered to _not_ truncate the min and max >> values for time_micros/timestamp_micros if the corresponding option is true. >> This class doesn’t have a reference to an OptionManager, so >> my guess is that the two new options must be extractred from the >> OptionManager when the ParquetReaderConfig instance is created. >> >> Filtering on microsecond columns would be done using 64-bit values rather >> than TIME/TIMESTAMP values, e.g. >> >> select * from <file> where <timestamp_micros_column> = 1705914906694751; >> >> I’ve tested the solution outlined above, and it seems to work when using >> sqlline and with the JDBC driver, but not with the web based interface. >> Any pointers to the relevent code for that would be appreciated. >> >> An alternative solution to the above could be to intercept all reading of >> the Parquet schemas and modifying the schema to report the >> microsecond columns as int64 columns, i.e. to completely discard the >> information that the columns contain time/timestamp values. >> This could potentially make parts of the code where it is not obvious that >> the time/timestamp properties of columns are used behave >> as expected. However, this variant would not align with how INT96 timestamps >> are handled. >> >> Any thoughts on this idea for how to access microsecond values would be >> highly appreciated. >> >> Thanks, >> >> /Peter >> >