CSV type inference isn't really ideal: it does a full scan of a file to 
determine this; you are doubling the amount of data you need to read. Unless 
you are just exploring files in your notebook, I'd recommend doing it once, 
getting the schema from it then using that as the basis for the code snippet 
where you really define the schema. That's when you can explicitly declare the 
schema types if the inferred ones aren't great.

(maybe I should write something which prints out the scala/py code for that 
declaration rather than having to do it by hand...)

On 27 Oct 2016, at 05:55, Hyukjin Kwon 
<gurwls...@gmail.com<mailto:gurwls...@gmail.com>> wrote:

Hi Koert,


Sorry, I thought you meant this is a regression between 2.0.0 and 2.0.1. I just 
checked It has not been supporting to infer DateType before[1].

Yes, it only supports to infer such data as timestamps currently.


[1]https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L85-L92




2016-10-27 9:12 GMT+09:00 Anand Viswanathan 
<anand_v...@ymail.com<mailto:anand_v...@ymail.com>>:
Hi,

you can use the customSchema(for DateType) and specify dateFormat in .option().
or
at spark dataframe side, you can convert the timestamp to date using cast to 
the column.

Thanks and regards,
Anand Viswanathan

On Oct 26, 2016, at 8:07 PM, Koert Kuipers 
<ko...@tresata.com<mailto:ko...@tresata.com>> wrote:

hey,
i create a file called test.csv with contents:
date
2015-01-01
2016-03-05

next i run this code in spark 2.0.1:
spark.read
  .format("csv")
  .option("header", true)
  .option("inferSchema", true)
  .load("test.csv")
  .printSchema

the result is:
root
 |-- date: timestamp (nullable = true)


On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon 
<gurwls...@gmail.com<mailto:gurwls...@gmail.com>> wrote:

There are now timestampFormat for TimestampType and dateFormat for DateType.

Do you mind if I ask to share your codes?

On 27 Oct 2016 2:16 a.m., "Koert Kuipers" 
<ko...@tresata.com<mailto:ko...@tresata.com>> wrote:
is there a reason a column with dates in format yyyy-mm-dd in a csv file is 
inferred to be TimestampType and not DateType?

thanks! koert




Reply via email to