Maxim Gekk created SPARK-25393:
----------------------------------
Summary: Parsing CSV strings in a column
Key: SPARK-25393
URL: https://issues.apache.org/jira/browse/SPARK-25393
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk
There are use cases when content in CSV format is dumped into an external
storage as one of columns. For example, CSV records are stored together with
other meta-info to Kafka. Current Spark API doesn't allow to parse such columns
directly. The existing method
[csv()|https://github.com/apache/spark/blob/e754887182304ad0d622754e33192ebcdd515965/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L487]
requires a dataset with one string column. The API is inconvenient in parsing
CSV column in dataset with many columns. The ticket aims to add new function
similar to
[from_json()|https://github.com/apache/spark/blob/d749d034a80f528932f613ac97f13cfb99acd207/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3456]
with the following signatures in Scala:
{code:scala}
def from_csv(e: Column, schema: StructType, options: Map[String, String]):
Column
{code}
and for using from Python, R and Java:
{code:scala}
def from_csv(e: Column, schema: String, options: java.util.Map[String,
String]): Column
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]