Re: spark infers date to be timestamp type

2016-10-27 Thread Steve Loughran
CSV type inference isn't really ideal: it does a full scan of a file to 
determine this; you are doubling the amount of data you need to read. Unless 
you are just exploring files in your notebook, I'd recommend doing it once, 
getting the schema from it then using that as the basis for the code snippet 
where you really define the schema. That's when you can explicitly declare the 
schema types if the inferred ones aren't great.

(maybe I should write something which prints out the scala/py code for that 
declaration rather than having to do it by hand...)

On 27 Oct 2016, at 05:55, Hyukjin Kwon 
> wrote:

Hi Koert,


Sorry, I thought you meant this is a regression between 2.0.0 and 2.0.1. I just 
checked It has not been supporting to infer DateType before[1].

Yes, it only supports to infer such data as timestamps currently.


[1]https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L85-L92




2016-10-27 9:12 GMT+09:00 Anand Viswanathan 
>:
Hi,

you can use the customSchema(for DateType) and specify dateFormat in .option().
or
at spark dataframe side, you can convert the timestamp to date using cast to 
the column.

Thanks and regards,
Anand Viswanathan

On Oct 26, 2016, at 8:07 PM, Koert Kuipers 
> wrote:

hey,
i create a file called test.csv with contents:
date
2015-01-01
2016-03-05

next i run this code in spark 2.0.1:
spark.read
  .format("csv")
  .option("header", true)
  .option("inferSchema", true)
  .load("test.csv")
  .printSchema

the result is:
root
 |-- date: timestamp (nullable = true)


On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon 
> wrote:

There are now timestampFormat for TimestampType and dateFormat for DateType.

Do you mind if I ask to share your codes?

On 27 Oct 2016 2:16 a.m., "Koert Kuipers" 
> wrote:
is there a reason a column with dates in format -mm-dd in a csv file is 
inferred to be TimestampType and not DateType?

thanks! koert






Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
Hi Koert,


Sorry, I thought you meant this is a regression between 2.0.0 and 2.0.1. I
just checked It has not been supporting to infer DateType before[1].

Yes, it only supports to infer such data as timestamps currently.


[1]
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L85-L92




2016-10-27 9:12 GMT+09:00 Anand Viswanathan :

> Hi,
>
> you can use the customSchema(for DateType) and specify dateFormat in
> .option().
> or
> at spark dataframe side, you can convert the timestamp to date using cast
> to the column.
>
> Thanks and regards,
> Anand Viswanathan
>
> On Oct 26, 2016, at 8:07 PM, Koert Kuipers  wrote:
>
> hey,
> i create a file called test.csv with contents:
> date
> 2015-01-01
> 2016-03-05
>
> next i run this code in spark 2.0.1:
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("inferSchema", true)
>   .load("test.csv")
>   .printSchema
>
> the result is:
> root
>  |-- date: timestamp (nullable = true)
>
>
> On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon  wrote:
>
>> There are now timestampFormat for TimestampType and dateFormat for
>> DateType.
>>
>> Do you mind if I ask to share your codes?
>>
>> On 27 Oct 2016 2:16 a.m., "Koert Kuipers"  wrote:
>>
>>> is there a reason a column with dates in format -mm-dd in a csv file
>>> is inferred to be TimestampType and not DateType?
>>>
>>> thanks! koert
>>>
>>
>
>


Re: spark infers date to be timestamp type

2016-10-26 Thread Anand Viswanathan
Hi,

you can use the customSchema(for DateType) and specify dateFormat in .option().
or 
at spark dataframe side, you can convert the timestamp to date using cast to 
the column.

Thanks and regards,
Anand Viswanathan

> On Oct 26, 2016, at 8:07 PM, Koert Kuipers  wrote:
> 
> hey,
> i create a file called test.csv with contents:
> date
> 2015-01-01
> 2016-03-05
> 
> next i run this code in spark 2.0.1:
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("inferSchema", true)
>   .load("test.csv")
>   .printSchema
> 
> the result is:
> root
>  |-- date: timestamp (nullable = true)
> 
> 
> On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon  > wrote:
> There are now timestampFormat for TimestampType and dateFormat for DateType.
> 
> Do you mind if I ask to share your codes?
> 
> 
> On 27 Oct 2016 2:16 a.m., "Koert Kuipers"  > wrote:
> is there a reason a column with dates in format -mm-dd in a csv file is 
> inferred to be TimestampType and not DateType?
> 
> thanks! koert
> 



Re: spark infers date to be timestamp type

2016-10-26 Thread Koert Kuipers
hey,
i create a file called test.csv with contents:
date
2015-01-01
2016-03-05

next i run this code in spark 2.0.1:
spark.read
  .format("csv")
  .option("header", true)
  .option("inferSchema", true)
  .load("test.csv")
  .printSchema

the result is:
root
 |-- date: timestamp (nullable = true)


On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon  wrote:

> There are now timestampFormat for TimestampType and dateFormat for
> DateType.
>
> Do you mind if I ask to share your codes?
>
> On 27 Oct 2016 2:16 a.m., "Koert Kuipers"  wrote:
>
>> is there a reason a column with dates in format -mm-dd in a csv file
>> is inferred to be TimestampType and not DateType?
>>
>> thanks! koert
>>
>


Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
There are now timestampFormat for TimestampType and dateFormat for DateType.

Do you mind if I ask to share your codes?

On 27 Oct 2016 2:16 a.m., "Koert Kuipers"  wrote:

> is there a reason a column with dates in format -mm-dd in a csv file
> is inferred to be TimestampType and not DateType?
>
> thanks! koert
>


spark infers date to be timestamp type

2016-10-26 Thread Koert Kuipers
is there a reason a column with dates in format -mm-dd in a csv file is
inferred to be TimestampType and not DateType?

thanks! koert