I have a feature requests or suggestion:
Spark 2.1 currently generates partitioned directory names like
"timestamp=2015-06-20 08%3A00%3A00"
I request + recommend that it uses the "T" delimiter between date and time
portions rather than a space character like,
"timestamp=2015-06-20T08%3A00%3A00".
Two reasons:
1) The official ISO-8601 formatting standard specifies a "T" delimiter. RFC
3339 built on top of ISO-8601 says that a space character is also
acceptable, but AFAIK, that is not part of the official ISO-8601 spec.
2) URIs can't have spaces in them.
"s3://mybucket/data/timestamp=YYYY-MM-ddTHH%3A:mm:ss" is a valid URI, while
the space character variant is not. Spark is already doing URI escaping of
the "colon" characters with "%3A". Spark should use a URI compliant "T"
character rather than a space.
This also applies to reading existing data. If I load a data frame with
directory timestamp partitioning that uses the Spark standard space
delimiter between date and time, Spark will automatically recognize the
field as a timestamp. If the directory name uses the ISO-8601 standard "T"
delimiter between date and time, Spark will not recognize the field as a
timestamp but rather as a generic string.
Below is a short code snippet that can be pasted into spark-shell to
reproduce this issue
```
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
import java.time.LocalDateTime
val simpleSchema = StructType(
StructField("id", IntegerType) ::
StructField("name", StringType) ::
StructField("value", StringType) ::
StructField("timestamp", TimestampType) :: Nil)
val data = List(
Row(1, "Alice", "C101",
java.sql.Timestamp.valueOf(LocalDateTime.of(2015, 6, 20, 8, 0))),
Row(2, "Bob", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 20, 8, 0))),
Row(3, "Bob", "C102", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 20, 9, 0))),
Row(4, "Bob", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 21, 9, 0)))
)
val df = spark.createDataFrame(data.asJava, simpleSchema)
df.printSchema()
df.show()
df.write.partitionBy("timestamp").save("test/")
```
~ find test -type d
test
test/timestamp=2015-06-20 08%3A00%3A00
test/timestamp=2015-06-20 09%3A00%3A00
test/timestamp=2015-06-21 09%3A00%3A00
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-formatting-in-partitioned-directory-output-YYYY-MM-dd-HH-3Amm-3Ass-vs-YYYY-MM-ddTHH-3Amm-3-tp21404.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]