rdblue commented on issue #118: Support plaintext Data (CSV, TSV, etc.) in Iceberg Tables URL: https://github.com/apache/incubator-iceberg/issues/118#issuecomment-469438013 Some of the guarantees made by Iceberg can't be satisfied by plain-text formats like CSV and JSON. The intent was for Iceberg tables to work the same way no matter what format is used to store the data. Here are a few problems: * These formats can't support the required schema evolution rules. Delimited formats (CSV, etc.) have no schema, so column resolution must be done by position, which breaks when columns are deleted from the middle of a schema. Similarly, JSON column resolution must be done by name, which breaks when a column is deleted and a new one is added with the same name. * Delimited formats also lack framing and use delimiters to handle nesting. This strategy breaks after a few levels of nesting so it isn't possible to use them for any schema. * Both CSV and JSON are not splittable formats, without extra rules or escaping. Even when using those hacks to make them splittable, they aren't compressible and splittable at the same time. Basically, these formats aren't suitable for tables that make guarantees about schema evolution or have features like splittability. We could add a different mode to support some of these, but I don't see enough value in it. I think the right path is to use a Spark or Hive table for CSV data and load it into an Iceberg table for long-term storage to get reliability and performance.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
