ANSHUMAN created SPARK-21763:
--------------------------------
Summary: InferSchema option does not infer the correct schema
(timestamp) from xlsx file.
Key: SPARK-21763
URL: https://issues.apache.org/jira/browse/SPARK-21763
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Environment: Environment is my personal laptop.
Reporter: ANSHUMAN
Priority: Minor
I have a xlsx file containing date/time filed (My Time) in following format and
sample records -
5/16/2017 12:19:00 AM
5/16/2017 12:56:00 AM
5/16/2017 1:17:00 PM
5/16/2017 5:26:00 PM
5/16/2017 6:26:00 PM
I am reading the xlsx file in following manner: -
{code:java}
val inputDF = spark.sqlContext.read.format("com.crealytics.spark.excel")
.option("location","file:///C:/Users/file.xlsx")
.option("useHeader","true")
.option("treatEmptyValuesAsNulls","true")
.option("inferSchema","true")
.option("addColorColumns","false")
.load()
{code}
When I try to get schema using
{code:java}
inputDF.printSchema()
{code}
, I get *Double*.
Sometimes, even I get the schema as *String*.
And when I print the data, I get the output as: -
+------------------+
| My Time|
+------------------+
|42871.014189814814|
| 42871.03973379629|
|42871.553773148145|
| 42871.72765046296|
| 42871.76887731482|
+------------------+
Above output is clearly not correct for the given input.
Moreover, if I convert the xlsx file in csv format and read it, I get the
output correctly. Here is the way how I read in csv format: -
{code:java}
spark.sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", true)
.load(fileLocation)
{code}
Please look into the issue. I could not find the answer to it anywhere.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]