[GitHub] [spark] swapnilushinde commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.

GitBox Mon, 27 May 2019 23:31:38 -0700

swapnilushinde commented on issue #24724: User friendly dataset, dataframe 
generation for csv datasources without explicit StructType definitions.
URL: https://github.com/apache/spark/pull/24724#issuecomment-496382800
 
 
   Hello @HyukjinKwon @MaxGekk -
   Above simple example was just for an illustration purpose. DDL format schema 
is not a good choice for many applications where CSV data contains many fields. 
However, case classes are required to be defined for those applications anyways 
for type safety. Maintaining schema definitions with both case class and DDL 
string is not an idea case.
   What's the recommended way to load tab delimited data with 10+ fields into 
spark dataset? Only options that I know are to define StructType for all those 
fields or long DDL statement. For type safety, case class needs to be defined 
anyways. This causes schema for single dataset to be defined twice in a given 
application. Wouldn't it be beneficial to just a have simpler API that only 
need case class to load CSV file?
   
   Proposed API gives single way to define schema using case class and load csv 
without StructType or DDL definitions.
   
   Parquet and json formats: It is already easy to load these formats with 
schema so no need or confusion to create equivalent API like proposed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] swapnilushinde commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.

Reply via email to