Hi David,

This is intentional, see
https://arrow.apache.org/docs/python/parquet.html#storing-timestamps for
some explanation in the documentation. Basicly, the parquet format only
supports ms and us resolution, and so nanosecond timestamps (which are
supported by Arrow) are converted to one of those resolutions.

We could maybe clarify that better in the error message (something like
"only 'ms' and 'us' are supported") ?

In the latest version of the parquet format specification, there is
actually support for nanosecond resolution as well (
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype).
You can obtain this by specifying version="2.0", but the implementation is
not yet fully ready (see https://issues.apache.org/jira/browse/PARQUET-458),
and also not all frameworks support this version (so if compatibility
across processing frameworks is important, it is recommended to stick with
version 1).

Joris

On Wed, 9 Oct 2019 at 21:27, David Boles <bibliobo...@gmail.com> wrote:

> The following code dies with pyarrow 0.14.2:
>
> import pyarrow as pa
> import pyarrow.parquet as pq
>
> schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),])
> writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns')
>
> ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns',
> tz='UTC'))
> table = pa.Table.from_arrays([ ts_array ], names=['timestamp'])
>
> writer.write_table(table)
> writer.close()
>
> with the message:
>
> ValueError: Invalid value for coerce_timestamps: ns
>
> That appears to be because of this code in _parquet.pxi:
>
>     cdef int _set_coerce_timestamps(
>             self, ArrowWriterProperties.Builder* props) except -1:
>         if self.coerce_timestamps == 'ms':
>             props.coerce_timestamps(TimeUnit_MILLI)
>         elif self.coerce_timestamps == 'us':
>             props.coerce_timestamps(TimeUnit_MICRO)
>         elif self.coerce_timestamps is not None:
>             raise ValueError('Invalid value for coerce_timestamps: {0}'
>                              .format(self.coerce_timestamps))
>
> which restricts the choice to 'ms' or 'us', even though AFAICT everywhere
> else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this
> intentional, or a bug?
>
> Thanks,
>
>  - db
>

Reply via email to