Re: from_csv

2018-09-19 Thread John Zhuge
+1

On Wed, Sep 19, 2018 at 8:07 AM Ted Yu  wrote:

> +1
>
>  Original message 
> From: Dongjin Lee 
> Date: 9/19/18 7:20 AM (GMT-08:00)
> To: dev 
> Subject: Re: from_csv
>
> Another +1.
>
> I already experienced this case several times.
>
> On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon  wrote:
>
>> +1 for this idea since text parsing in CSV/JSON is quite common.
>>
>> One thing is about schema inference likewise with JSON functionality. In
>> case of JSON, we added schema_of_json for it and same thing should be able
>> to apply to CSV too.
>> If we see some more needs for it, we can consider a function like
>> schema_of_csv as well.
>>
>>
>> 2018년 9월 16일 (일) 오후 4:41, Maxim Gekk 님이 작성:
>>
>>> Hi Reynold,
>>>
>>> > i'd make this as consistent as to_json / from_json as possible
>>>
>>> Sure, new function from_csv() has the same signature as from_json().
>>>
>>> > how would this work in sql? i.e. how would passing options in work?
>>>
>>> The options are passed to the function via map, for example:
>>> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat',
>>> 'dd/MM/'))
>>>
>>> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin  wrote:
>>>
>>>> makes sense - i'd make this as consistent as to_json / from_json as
>>>> possible.
>>>>
>>>> how would this work in sql? i.e. how would passing options in work?
>>>>
>>>> --
>>>> excuse the brevity and lower case due to wrist injury
>>>>
>>>>
>>>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk 
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I would like to propose new function from_csv() for parsing columns
>>>>> containing strings in CSV format. Here is my PR:
>>>>> https://github.com/apache/spark/pull/22379
>>>>>
>>>>> An use case is loading a dataset from an external storage, dbms or
>>>>> systems like Kafka to where CSV content was dumped as one of
>>>>> columns/fields. Other columns could contain related information like
>>>>> timestamps, ids, sources of data and etc. The column with CSV strings can
>>>>> be parsed by existing method csv() of DataFrameReader but in that
>>>>> case we have to "clean up" dataset and remove other columns since the
>>>>> csv() method requires Dataset[String]. Joining back result of parsing
>>>>> and original dataset by positions is expensive and not convenient. Instead
>>>>> users parse CSV columns by string functions. The approach is usually error
>>>>> prone especially for quoted values and other special cases.
>>>>>
>>>>> The proposed in the PR methods should make a better user experience in
>>>>> parsing CSV-like columns. Please, share your thoughts.
>>>>>
>>>>> --
>>>>>
>>>>> Maxim Gekk
>>>>>
>>>>> Technical Solutions Lead
>>>>>
>>>>> Databricks Inc.
>>>>>
>>>>> maxim.g...@databricks.com
>>>>>
>>>>> databricks.com
>>>>>
>>>>>   <http://databricks.com/>
>>>>>
>>>>
>>>
>
> --
> *Dongjin Lee*
>
> *A hitchhiker in the mathematical world.*
>
> *github:  <http://goog_969573159/>github.com/dongjinleekr
> <http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr
> <http://kr.linkedin.com/in/dongjinleekr>slideshare: 
> www.slideshare.net/dongjinleekr
> <http://www.slideshare.net/dongjinleekr>*
>


-- 
John Zhuge


Re: from_csv

2018-09-19 Thread Ted Yu
+1
 Original message From: Dongjin Lee  Date: 
9/19/18  7:20 AM  (GMT-08:00) To: dev  Subject: Re: 
from_csv 
Another +1.
I already experienced this case several times.

On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon  wrote:
+1 for this idea since text parsing in CSV/JSON is quite common.
One thing is about schema inference likewise with JSON functionality. In case 
of JSON, we added schema_of_json for it and same thing should be able to apply 
to CSV too.
If we see some more needs for it, we can consider a function like schema_of_csv 
as well.

2018년 9월 16일 (일) 오후 4:41, Maxim Gekk 님이 작성:
Hi Reynold,
> i'd make this as consistent as to_json / from_json as possible
Sure, new function from_csv() has the same signature as from_json().
> how would this work in sql? i.e. how would passing options in work?
The options are passed to the function via map, for example:select 
from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin  wrote:
makes sense - i'd make this as consistent as to_json / from_json as possible. 
how would this work in sql? i.e. how would passing options in work?
--excuse the brevity and lower case due to wrist injury

On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk  wrote:
Hi All,
I would like to propose new function from_csv() for parsing columns containing 
strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379
An use case is loading a dataset from an external storage, dbms or systems like 
Kafka to where CSV content was dumped as one of columns/fields. Other columns 
could contain related information like timestamps, ids, sources of data and 
etc. The column with CSV strings can be parsed by existing method csv() of 
DataFrameReader but in that case we have to "clean up" dataset and remove other 
columns since the csv() method requires Dataset[String]. Joining back result of 
parsing and original dataset by positions is expensive and not convenient. 
Instead users parse CSV columns by string functions. The approach is usually 
error prone especially for quoted values and other special cases.
The proposed in the PR methods should make a better user experience in parsing 
CSV-like columns. Please, share your thoughts.
-- 

Maxim Gekk
Technical Solutions LeadDatabricks inc.maxim.g...@databricks.comdatabricks.com 






-- 
Dongjin Lee
A hitchhiker in the mathematical world.
github: github.com/dongjinleekrlinkedin: 
kr.linkedin.com/in/dongjinleekrslideshare: www.slideshare.net/dongjinleekr


Re: from_csv

2018-09-19 Thread Dongjin Lee
Another +1.

I already experienced this case several times.

On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon  wrote:

> +1 for this idea since text parsing in CSV/JSON is quite common.
>
> One thing is about schema inference likewise with JSON functionality. In
> case of JSON, we added schema_of_json for it and same thing should be able
> to apply to CSV too.
> If we see some more needs for it, we can consider a function like
> schema_of_csv as well.
>
>
> 2018년 9월 16일 (일) 오후 4:41, Maxim Gekk 님이 작성:
>
>> Hi Reynold,
>>
>> > i'd make this as consistent as to_json / from_json as possible
>>
>> Sure, new function from_csv() has the same signature as from_json().
>>
>> > how would this work in sql? i.e. how would passing options in work?
>>
>> The options are passed to the function via map, for example:
>> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat',
>> 'dd/MM/'))
>>
>> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin  wrote:
>>
>>> makes sense - i'd make this as consistent as to_json / from_json as
>>> possible.
>>>
>>> how would this work in sql? i.e. how would passing options in work?
>>>
>>> --
>>> excuse the brevity and lower case due to wrist injury
>>>
>>>
>>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk 
>>> wrote:
>>>
 Hi All,

 I would like to propose new function from_csv() for parsing columns
 containing strings in CSV format. Here is my PR:
 https://github.com/apache/spark/pull/22379

 An use case is loading a dataset from an external storage, dbms or
 systems like Kafka to where CSV content was dumped as one of
 columns/fields. Other columns could contain related information like
 timestamps, ids, sources of data and etc. The column with CSV strings can
 be parsed by existing method csv() of DataFrameReader but in that case
 we have to "clean up" dataset and remove other columns since the csv()
 method requires Dataset[String]. Joining back result of parsing and
 original dataset by positions is expensive and not convenient. Instead
 users parse CSV columns by string functions. The approach is usually error
 prone especially for quoted values and other special cases.

 The proposed in the PR methods should make a better user experience in
 parsing CSV-like columns. Please, share your thoughts.

 --

 Maxim Gekk

 Technical Solutions Lead

 Databricks Inc.

 maxim.g...@databricks.com

 databricks.com

   

>>>
>>

-- 
*Dongjin Lee*

*A hitchhiker in the mathematical world.*

*github:  github.com/dongjinleekr
linkedin: kr.linkedin.com/in/dongjinleekr
slideshare:
www.slideshare.net/dongjinleekr
*


Re: from_csv

2018-09-16 Thread Hyukjin Kwon
+1 for this idea since text parsing in CSV/JSON is quite common.

One thing is about schema inference likewise with JSON functionality. In
case of JSON, we added schema_of_json for it and same thing should be able
to apply to CSV too.
If we see some more needs for it, we can consider a function like
schema_of_csv as well.


2018년 9월 16일 (일) 오후 4:41, Maxim Gekk 님이 작성:

> Hi Reynold,
>
> > i'd make this as consistent as to_json / from_json as possible
>
> Sure, new function from_csv() has the same signature as from_json().
>
> > how would this work in sql? i.e. how would passing options in work?
>
> The options are passed to the function via map, for example:
> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat',
> 'dd/MM/'))
>
> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin  wrote:
>
>> makes sense - i'd make this as consistent as to_json / from_json as
>> possible.
>>
>> how would this work in sql? i.e. how would passing options in work?
>>
>> --
>> excuse the brevity and lower case due to wrist injury
>>
>>
>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk 
>> wrote:
>>
>>> Hi All,
>>>
>>> I would like to propose new function from_csv() for parsing columns
>>> containing strings in CSV format. Here is my PR:
>>> https://github.com/apache/spark/pull/22379
>>>
>>> An use case is loading a dataset from an external storage, dbms or
>>> systems like Kafka to where CSV content was dumped as one of
>>> columns/fields. Other columns could contain related information like
>>> timestamps, ids, sources of data and etc. The column with CSV strings can
>>> be parsed by existing method csv() of DataFrameReader but in that case
>>> we have to "clean up" dataset and remove other columns since the csv()
>>> method requires Dataset[String]. Joining back result of parsing and
>>> original dataset by positions is expensive and not convenient. Instead
>>> users parse CSV columns by string functions. The approach is usually error
>>> prone especially for quoted values and other special cases.
>>>
>>> The proposed in the PR methods should make a better user experience in
>>> parsing CSV-like columns. Please, share your thoughts.
>>>
>>> --
>>>
>>> Maxim Gekk
>>>
>>> Technical Solutions Lead
>>>
>>> Databricks Inc.
>>>
>>> maxim.g...@databricks.com
>>>
>>> databricks.com
>>>
>>>   
>>>
>>
>


Re: from_csv

2018-09-16 Thread Maxim Gekk
Hi Reynold,

> i'd make this as consistent as to_json / from_json as possible

Sure, new function from_csv() has the same signature as from_json().

> how would this work in sql? i.e. how would passing options in work?

The options are passed to the function via map, for example:
select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat',
'dd/MM/'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin  wrote:

> makes sense - i'd make this as consistent as to_json / from_json as
> possible.
>
> how would this work in sql? i.e. how would passing options in work?
>
> --
> excuse the brevity and lower case due to wrist injury
>
>
> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk 
> wrote:
>
>> Hi All,
>>
>> I would like to propose new function from_csv() for parsing columns
>> containing strings in CSV format. Here is my PR:
>> https://github.com/apache/spark/pull/22379
>>
>> An use case is loading a dataset from an external storage, dbms or
>> systems like Kafka to where CSV content was dumped as one of
>> columns/fields. Other columns could contain related information like
>> timestamps, ids, sources of data and etc. The column with CSV strings can
>> be parsed by existing method csv() of DataFrameReader but in that case
>> we have to "clean up" dataset and remove other columns since the csv()
>> method requires Dataset[String]. Joining back result of parsing and
>> original dataset by positions is expensive and not convenient. Instead
>> users parse CSV columns by string functions. The approach is usually error
>> prone especially for quoted values and other special cases.
>>
>> The proposed in the PR methods should make a better user experience in
>> parsing CSV-like columns. Please, share your thoughts.
>>
>> --
>>
>> Maxim Gekk
>>
>> Technical Solutions Lead
>>
>> Databricks Inc.
>>
>> maxim.g...@databricks.com
>>
>> databricks.com
>>
>>   
>>
>


Re: from_csv

2018-09-15 Thread Reynold Xin
makes sense - i'd make this as consistent as to_json / from_json as
possible.

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk 
wrote:

> Hi All,
>
> I would like to propose new function from_csv() for parsing columns
> containing strings in CSV format. Here is my PR:
> https://github.com/apache/spark/pull/22379
>
> An use case is loading a dataset from an external storage, dbms or systems
> like Kafka to where CSV content was dumped as one of columns/fields. Other
> columns could contain related information like timestamps, ids, sources of
> data and etc. The column with CSV strings can be parsed by existing method
> csv() of DataFrameReader but in that case we have to "clean up" dataset
> and remove other columns since the csv() method requires Dataset[String].
> Joining back result of parsing and original dataset by positions is
> expensive and not convenient. Instead users parse CSV columns by string
> functions. The approach is usually error prone especially for quoted values
> and other special cases.
>
> The proposed in the PR methods should make a better user experience in
> parsing CSV-like columns. Please, share your thoughts.
>
> --
>
> Maxim Gekk
>
> Technical Solutions Lead
>
> Databricks Inc.
>
> maxim.g...@databricks.com
>
> databricks.com
>
>   
>