SQL allows timestamps to be stored with any precision (i.e. number of digits 
after the decimal point) between 0 and 9. That strongly indicates to me that 
the right implementation of timestamps is as (fixed point) decimal values.

Then devote your efforts to getting the decimal type working correctly.


> On Jun 27, 2016, at 3:16 PM, Wes McKinney <[email protected]> wrote:
> 
> hi Uwe,
> 
> Thanks for bringing this up. So far we've largely been skirting the
> "Logical Types Rabbit Hole", but it would be good to start a document
> collecting requirements for various logical types (e.g. timestamps) so
> that we can attempt to achieve good solutions on the first try based
> on the experiences (good and bad) of other projects.
> 
> In the IPC flatbuffers metadata spec that we drafted for discussion /
> prototype implementation earlier this year [1], we do have a Timestamp
> logical type containing only a timezone optional field [2]. If you
> contrast this with Feather (which uses Arrow's physical memory layout,
> but custom metadata to suit Python/R needs), that has both a unit and
> timezone [3].
> 
> Since there is little consensus in the units of timestamps (more
> consensus around the UNIX 1970-01-01 epoch, but not even 100%
> uniformity), I believe the best route would be to add a unit to the
> metadata to indicates second through nanosecond resolution. Same goes
> for a Time type.
> 
> For example, Parquet has both milliseconds and microseconds (in
> Parquet 2.0). But earlier versions of Parquet don't have this at all
> [4]. Other systems like Hive and Impala are relying on their own table
> metadata to convert back and forth (e.g. embedding timestamps of
> whatever resolution in int64 or int96).
> 
> For Python pandas that want to use Parquet files (via Arrow) in their
> workflow, we're stuck with a couple options:
> 
> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
> (or MILLIS? Not all Parquet readers may be aware of the new
> microsecond ConvertedType)
> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
> ColumnMetaData::key_value_metadata (it's better than nothing?).
> 
> I see use cases for both of these -- for Option 1, you may care about
> interoperability with another system that uses Parquet. For Option 2,
> you may care about preserving the fidelity of your pandas data.
> Realistically, #1 seems like the best default option. It makes sense
> to offer #2 as an option.
> 
> I don't think addressing time zones in the first pass is strictly
> necessary, but as long as we store timestamps as UTC, we can also put
> the time zone in the KeyValue metadata.
> 
> I'm not sure about the Interval type -- let's create a JIRA and tackle
> that in a separate discussion. I agree that it merits inclusion as a
> logical type, but I'm not sure what storage representation makes the
> most sense (e.g. is is not clear to me why Parquet does not store the
> interval as an absolute number of milliseconds; perhaps to accommodate
> month-based intervals which may have different absolute lengths
> depending on where you start).
> 
> Let me know what you think, and if others have thoughts I'd be interested too.
> 
> thanks,
> Wes
> 
> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
> [2] : https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
> [3]: 
> https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
> [4]: 
> https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
> 
> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <[email protected]> wrote:
>> Hello,
>> 
>> in addition to categoricals, we also miss at the moment a conversion from
>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two (exact)
>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html notes there
>> are several more. We do not need to cater for all but at least some of them.
>> Therefore I have the following questions which I like to have solved in some
>> form before implementing:
>> 
>> * Do we want to cater for other resolutions?
>> * If we do not provide, e.g. nanosecond resolution (sadly the default
>>   in Pandas), do we cast with precision loss to the nearest match? Or
>>   should we force the user to do it?
>> * Not so important for me at the moment: Do we want to support time zones?
>> 
>> My current objective is to have them for Parquet file writing. Sadly this
>> has the same limitations. So the two main options seem to be
>> 
>> * "roundtrip will only yield correct timezone and logical type if we
>>   read with Arrow/Pandas again (as we use "proprietary" metadata to
>>   encode it)"
>> * "we restrict us to milliseconds and days as resolution" (for the
>>   latter option, we need to decide how graceful we want to be in the
>>   Pandas<->Arrow conversion).
>> 
>> Further datatype we have not yet in Arrow but partly in Parquet is timedelta
>> (or INTERVAL in Parquet). Probably we need to add another logical type to
>> Arrow to implement them. Open for suggestions here, too.
>> 
>> Also in the Arrow spec there is TIME which seems to be the same as TIMESTAMP
>> (as far as the comments in the C++ code goes). Is there maybe some
>> distinction I'm missing?
>> 
>> Cheers
>> 
>> Uwe
>> 

Reply via email to