[
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840018#comment-16840018
]
Ruslan Dautkhanov edited comment on SPARK-15463 at 5/15/19 4:00 PM:
--------------------------------------------------------------------
[~hyukjin.kwon] would it be possible to make csvParsing optional (and have only
schema inference)?
We have an RDD with columns stored separately in a tuple .. but all strings.
Would be great to infer schema without parsing a single String as a csv.
Current workaround is to glue all strings together (with proper quoting,
escaping etc) just so that bring columns back here
[https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156]
into a list of columns and finally use inferSchema
infer() already accepts `RDD[Array[String]] ` but current API only accepts
`RDD[String]` (or `Dataset[String]`)
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51]
I don't think it's a public API to use infer() directly from
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30]
? It seems to be a better workaround than collapsing all-string columns to a
csv, parse it internally by Spark only to infer data types of those columns.
Thank you for any leads.
was (Author: tagar):
[~hyukjin.kwon] would it be possible to make csvParsing optional (and have only
?
We have an RDD with columns stored separately in a tuple .. but all strings.
Would be great to infer schema without parsing a single String as a csv.
Current workaround is to glue all strings together (with proper quoting,
escaping etc) just so that bring columns back here
[https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156]
into a list of columns and finally use inferSchema
infer() already accepts `RDD[Array[String]] ` but current API only accepts
`RDD[String]` (or `Dataset[String]`)
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51]
I don't think it's a public API to use infer() directly from
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30]
? It seems to be a better workaround than collapsing all-string columns to a
csv, parse it internally by Spark only to infer data types of those columns.
Thank you for any leads.
> Support for creating a dataframe from CSV in Dataset[String]
> ------------------------------------------------------------
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: PJ Fanning
> Assignee: Hyukjin Kwon
> Priority: Major
> Fix For: 2.2.0
>
>
> I currently use Databrick's spark-csv lib but some features don't work with
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't
> appear to support the creation of DataFrames based on loading from
> RDD[String].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]