Re: Question about timestamps ...

David Boles Thu, 10 Oct 2019 11:01:31 -0700

Joris,

Thank you for the response. There's such a trail of stale information
online w/r to the overall that it wasn't clear what the status was. For
example, simple searches take you into the "INT96 is deprecated therefore
suppport for nanoseconds is as well" cul-de-sac. Absence that confusing
context, the existing error message is fine.


It's worth noting that accurate and precise timestamps down to ~0.1
nanosecond are widely available, with 0.02ns being available for just a few
thousand $US.

I'll stick with usec resolution for absolute time and just use an int64
field for my nanosecond data.

Thanks again.

 - db

On Thu, Oct 10, 2019 at 5:11 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi David,
>
> This is intentional, see
> https://arrow.apache.org/docs/python/parquet.html#storing-timestamps for
> some explanation in the documentation. Basicly, the parquet format only
> supports ms and us resolution, and so nanosecond timestamps (which are
> supported by Arrow) are converted to one of those resolutions.
>
> We could maybe clarify that better in the error message (something like
> "only 'ms' and 'us' are supported") ?
>
> In the latest version of the parquet format specification, there is
> actually support for nanosecond resolution as well (
>
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype
> ).
> You can obtain this by specifying version="2.0", but the implementation is
> not yet fully ready (see https://issues.apache.org/jira/browse/PARQUET-458
> ),
> and also not all frameworks support this version (so if compatibility
> across processing frameworks is important, it is recommended to stick with
> version 1).
>
> Joris
>
> On Wed, 9 Oct 2019 at 21:27, David Boles <bibliobo...@gmail.com> wrote:
>
> > The following code dies with pyarrow 0.14.2:
> >
> > import pyarrow as pa
> > import pyarrow.parquet as pq
> >
> > schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),])
> > writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns')
> >
> > ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns',
> > tz='UTC'))
> > table = pa.Table.from_arrays([ ts_array ], names=['timestamp'])
> >
> > writer.write_table(table)
> > writer.close()
> >
> > with the message:
> >
> > ValueError: Invalid value for coerce_timestamps: ns
> >
> > That appears to be because of this code in _parquet.pxi:
> >
> >     cdef int _set_coerce_timestamps(
> >             self, ArrowWriterProperties.Builder* props) except -1:
> >         if self.coerce_timestamps == 'ms':
> >             props.coerce_timestamps(TimeUnit_MILLI)
> >         elif self.coerce_timestamps == 'us':
> >             props.coerce_timestamps(TimeUnit_MICRO)
> >         elif self.coerce_timestamps is not None:
> >             raise ValueError('Invalid value for coerce_timestamps: {0}'
> >                              .format(self.coerce_timestamps))
> >
> > which restricts the choice to 'ms' or 'us', even though AFAICT everywhere
> > else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this
> > intentional, or a bug?
> >
> > Thanks,
> >
> >  - db
> >
>

Re: Question about timestamps ...

Reply via email to