Timestamp formatting in partitioned directory output: "YYYY-MM-dd HH%3Amm%3Ass" vs "YYYY-MM-ddTHH%3Amm%3Ass"

dataeng88 Fri, 21 Apr 2017 09:32:25 -0700

I have a feature requests or suggestion:

Spark 2.1 currently generates partitioned directory names like
"timestamp=2015-06-20 08%3A00%3A00"


I request + recommend that it uses the "T" delimiter between date and time
portions rather than a space character like,
"timestamp=2015-06-20T08%3A00%3A00".

Two reasons:
1) The official ISO-8601 formatting standard specifies a "T" delimiter. RFC
3339 built on top of ISO-8601 says that a space character is also
acceptable, but AFAIK, that is not part of the official ISO-8601 spec.

2) URIs can't have spaces in them.
"s3://mybucket/data/timestamp=YYYY-MM-ddTHH%3A:mm:ss" is a valid URI, while
the space character variant is not. Spark is already doing URI escaping of
the "colon" characters with "%3A". Spark should use a URI compliant "T"
character rather than a space.

This also applies to reading existing data. If I load a data frame with
directory timestamp partitioning that uses the Spark standard space
delimiter between date and time, Spark will automatically recognize the
field as a timestamp. If the directory name uses the ISO-8601 standard "T"
delimiter between date and time, Spark will not recognize the field as a
timestamp but rather as a generic string.

Below is a short code snippet that can be pasted into spark-shell to
reproduce this issue

```
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
import java.time.LocalDateTime

val simpleSchema = StructType(
    StructField("id", IntegerType) ::
    StructField("name", StringType) ::
    StructField("value", StringType) ::
    StructField("timestamp", TimestampType) :: Nil)

val data = List(
    Row(1, "Alice", "C101",
java.sql.Timestamp.valueOf(LocalDateTime.of(2015, 6, 20, 8, 0))),
    Row(2, "Bob", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 20, 8, 0))),
    Row(3, "Bob", "C102", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 20, 9, 0))),
    Row(4, "Bob", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 21, 9, 0)))
)

val df = spark.createDataFrame(data.asJava, simpleSchema)
df.printSchema()
df.show()
df.write.partitionBy("timestamp").save("test/")
```

~ find test -type d
test
test/timestamp=2015-06-20 08%3A00%3A00
test/timestamp=2015-06-20 09%3A00%3A00
test/timestamp=2015-06-21 09%3A00%3A00




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-formatting-in-partitioned-directory-output-YYYY-MM-dd-HH-3Amm-3Ass-vs-YYYY-MM-ddTHH-3Amm-3-tp21404.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Timestamp formatting in partitioned directory output: "YYYY-MM-dd HH%3Amm%3Ass" vs "YYYY-MM-ddTHH%3Amm%3Ass"

Reply via email to