[
https://issues.apache.org/jira/browse/ARROW-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joshua Pedrick closed ARROW-7678.
---------------------------------
Resolution: Invalid
Managed to recreate the bug without setting TZ.
> [C++][Parquet] setting TZ= in environment on Linux causes broken parquet
> ------------------------------------------------------------------------
>
> Key: ARROW-7678
> URL: https://issues.apache.org/jira/browse/ARROW-7678
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 0.15.1
> Environment: Linux, Ubuntu 18.04, arrow/parquet 0.15.1 from
> instructions https://arrow.apache.org/install/
> Reporter: Joshua Pedrick
> Priority: Blocker
>
> When I set TZ=CST-8, or other timezone on Linux time columns are corrupted in
> my resulting parquet file.
>
> Below are the calls I use to define my schema:
>
> {code:java}
> PrimitiveNode::Make( columnName, Repetition::REQUIRED,
> LogicalType::Timestamp( true, LogicalType::TimeUnit::MICROS, false, false ),
> ::parquet::Type::INT64 ) );
> PrimitiveNode::Make( columnName,
> repetition,
> LogicalType::Time( true, LogicalType::TimeUnit::MICROS ),
> ::parquet::Type::INT64 ) );
> {code}
> I use an Int64Writer for both types. When reading, in this case using pandas
> with pyarrow, but also in C++, I get the following exception:
> {code:java}
> File "pyarrow/_parquet.pyx", line 1136, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException:
> Invalid data
> Deserializing page header failed.{code}
> Seems as if the column header must be defining a timestamp+timezone even
> though I manually set is_adjusted_to_utc.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)