+1 for this idea since text parsing in CSV/JSON is quite common. One thing is about schema inference likewise with JSON functionality. In case of JSON, we added schema_of_json for it and same thing should be able to apply to CSV too. If we see some more needs for it, we can consider a function like schema_of_csv as well.
2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <maxim.g...@databricks.com>님이 작성: > Hi Reynold, > > > i'd make this as consistent as to_json / from_json as possible > > Sure, new function from_csv() has the same signature as from_json(). > > > how would this work in sql? i.e. how would passing options in work? > > The options are passed to the function via map, for example: > select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', > 'dd/MM/yyyy')) > > On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <r...@databricks.com> wrote: > >> makes sense - i'd make this as consistent as to_json / from_json as >> possible. >> >> how would this work in sql? i.e. how would passing options in work? >> >> -- >> excuse the brevity and lower case due to wrist injury >> >> >> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <maxim.g...@databricks.com> >> wrote: >> >>> Hi All, >>> >>> I would like to propose new function from_csv() for parsing columns >>> containing strings in CSV format. Here is my PR: >>> https://github.com/apache/spark/pull/22379 >>> >>> An use case is loading a dataset from an external storage, dbms or >>> systems like Kafka to where CSV content was dumped as one of >>> columns/fields. Other columns could contain related information like >>> timestamps, ids, sources of data and etc. The column with CSV strings can >>> be parsed by existing method csv() of DataFrameReader but in that case >>> we have to "clean up" dataset and remove other columns since the csv() >>> method requires Dataset[String]. Joining back result of parsing and >>> original dataset by positions is expensive and not convenient. Instead >>> users parse CSV columns by string functions. The approach is usually error >>> prone especially for quoted values and other special cases. >>> >>> The proposed in the PR methods should make a better user experience in >>> parsing CSV-like columns. Please, share your thoughts. >>> >>> -- >>> >>> Maxim Gekk >>> >>> Technical Solutions Lead >>> >>> Databricks Inc. >>> >>> maxim.g...@databricks.com >>> >>> databricks.com >>> >>> <http://databricks.com/> >>> >> >