Paul Bormans created ARROW-12644:
------------------------------------
Summary: Can't read from parquet partitioned by date/time (Spark)
Key: ARROW-12644
URL: https://issues.apache.org/jira/browse/ARROW-12644
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Paul Bormans
I'm using Spark (3.1.1) to write a dataframe to a partitioned parquet dataset
(using delta.io) which is partitioned by a timestamp field.
The relevant Spark code:
{code:java}
// code placeholder
(
df.withColumn(
"Date",
sf.date_trunc(
"DAY",
sf.from_unixtime(
(sf.col("MyEpochField")),
),
),
)
.write.format("delta")
.mode("append")
.partitionBy("Date")
.save("...")
{code}
This gives a structure like following:
{code:java}
// code placeholder
/tip
/tip/Date=2021-05-04 00%3A00%3A00
/tip/Date=2021-05-04 00%3A00%3A00/Time=2021-05-04 07%3A27%3A00
/tip/Date=2021-05-04 00%3A00%3A00/Time=2021-05-04
07%3A27%3A00/part-00000-8846eb80-a369-43f6-a715-fec9cf1adf95.c000.snappy.parquet
{code}
Notice the : character is (url?) encoded because of fs protocol violation.
When i try to open this dataset using delta-rs
([https://github.com/delta-io/delta-rs)] which uses Arrow below the hood, then
an error is raised trying to parse the Date (folder) value.
{code:java}
// code placeholder
pyarrow.lib.ArrowInvalid: error parsing '2021-05-03 00%3A00%3A00' as scalar of
type timestamp[ns]
{code}
It seems this error is raised in ScalarParseImpl => ParseValue =>
StringConverter<TimestampType>::Convert => ParseTimestampISO8601
The mentioned parse method does support for format:
{code:java}
// code placeholder
static inline bool ParseTimestampISO8601(const char* s, size_t length,
TimeUnit::type unit,
TimestampType::c_type* out) {
using seconds_type = std::chrono::duration<TimestampType::c_type>; // We
allow the following formats for all units:
// - "YYYY-MM-DD"
// - "YYYY-MM-DD[ T]hhZ?"
// - "YYYY-MM-DD[ T]hh:mmZ?"
// - "YYYY-MM-DD[ T]hh:mm:ssZ?"
<...>{code}
But may not support (url?) decoding the value upfront?
Questions we have:
* Should Arrow support timestamp fields when used as partitioned field?
* Where to decode?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)