swapnilushinde commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496382800 Hello @HyukjinKwon @MaxGekk - Above simple example was just for an illustration purpose. DDL format schema is not a good choice for many applications where CSV data contains many fields. However, case classes are required to be defined for those applications anyways for type safety. Maintaining schema definitions with both case class and DDL string is not an idea case. What's the recommended way to load tab delimited data with 10+ fields into spark dataset? Only options that I know are to define StructType for all those fields or long DDL statement. For type safety, case class needs to be defined anyways. This causes schema for single dataset to be defined twice in a given application. Wouldn't it be beneficial to just a have simpler API that only need case class to load CSV file? Proposed API gives single way to define schema using case class and load csv without StructType or DDL definitions. Parquet and json formats: It is already easy to load these formats with schema so no need or confusion to create equivalent API like proposed.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
