+1 for this idea since text parsing in CSV/JSON is quite common.

One thing is about schema inference likewise with JSON functionality. In
case of JSON, we added schema_of_json for it and same thing should be able
to apply to CSV too.
If we see some more needs for it, we can consider a function like
schema_of_csv as well.


2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <maxim.g...@databricks.com>님이 작성:

> Hi Reynold,
>
> > i'd make this as consistent as to_json / from_json as possible
>
> Sure, new function from_csv() has the same signature as from_json().
>
> > how would this work in sql? i.e. how would passing options in work?
>
> The options are passed to the function via map, for example:
> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat',
> 'dd/MM/yyyy'))
>
> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <r...@databricks.com> wrote:
>
>> makes sense - i'd make this as consistent as to_json / from_json as
>> possible.
>>
>> how would this work in sql? i.e. how would passing options in work?
>>
>> --
>> excuse the brevity and lower case due to wrist injury
>>
>>
>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <maxim.g...@databricks.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I would like to propose new function from_csv() for parsing columns
>>> containing strings in CSV format. Here is my PR:
>>> https://github.com/apache/spark/pull/22379
>>>
>>> An use case is loading a dataset from an external storage, dbms or
>>> systems like Kafka to where CSV content was dumped as one of
>>> columns/fields. Other columns could contain related information like
>>> timestamps, ids, sources of data and etc. The column with CSV strings can
>>> be parsed by existing method csv() of DataFrameReader but in that case
>>> we have to "clean up" dataset and remove other columns since the csv()
>>> method requires Dataset[String]. Joining back result of parsing and
>>> original dataset by positions is expensive and not convenient. Instead
>>> users parse CSV columns by string functions. The approach is usually error
>>> prone especially for quoted values and other special cases.
>>>
>>> The proposed in the PR methods should make a better user experience in
>>> parsing CSV-like columns. Please, share your thoughts.
>>>
>>> --
>>>
>>> Maxim Gekk
>>>
>>> Technical Solutions Lead
>>>
>>> Databricks Inc.
>>>
>>> maxim.g...@databricks.com
>>>
>>> databricks.com
>>>
>>>   <http://databricks.com/>
>>>
>>
>

Reply via email to