[ https://issues.apache.org/jira/browse/SPARK-26325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277405#comment-17277405 ]
Daniel Himmelstein commented on SPARK-26325: -------------------------------------------- h1. Solution in pyspark 3.0.1 Turns out there is an {{inferTimestamp }}option that must be enabled. >From the spark [migration guide|https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-301]: {quote}In Spark 3.0, JSON datasource and JSON function {{schema_of_json}} infer TimestampType from string values if they match to the pattern defined by the JSON option {{timestampFormat}}. Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option {{inferTimestamp}} to {{true}} to enable such type inference. {quote} Surprised this would occur in a patch release and is not reflected yet in the [latest docs|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html]. But looks like it correlated with a major performance decrease so was turned off by default: [apache/spark#28966|https://github.com/apache/spark/pull/28966], SPARK-26325, and SPARK-32130. So in pyspark 3.0.1: {code:python} line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}' rdd = spark.sparkContext.parallelize([line]) ( spark.read .option("inferTimestamp", "true") .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS'Z'") .json(path=rdd) ){code} Returns: {code:java} DataFrame[time_field: timestamp] {code} Yay! > Interpret timestamp fields in Spark while reading json (timestampFormat) > ------------------------------------------------------------------------ > > Key: SPARK-26325 > URL: https://issues.apache.org/jira/browse/SPARK-26325 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: Veenit Shah > Priority: Major > > I am trying to read a pretty printed json which has time fields in it. I want > to interpret the timestamps columns as timestamp fields while reading the > json itself. However, it's still reading them as string when I {{printSchema}} > E.g. Input json file - > {code:java} > [{ > "time_field" : "2017-09-30 04:53:39.412496Z" > }] > {code} > Code - > {code:java} > df = spark.read.option("multiLine", > "true").option("timestampFormat","yyyy-MM-dd > HH:mm:ss.SSSSSS'Z'").json('path_to_json_file') > {code} > Output of df.printSchema() - > {code:java} > root > |-- time_field: string (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org