[GitHub] [incubator-iceberg] rdblue commented on issue #118: Support plaintext Data (CSV, TSV, etc.) in Iceberg Tables

GitBox Mon, 04 Mar 2019 13:58:06 -0800

rdblue commented on issue #118: Support plaintext Data (CSV, TSV, etc.) in 
Iceberg Tables
URL: 
https://github.com/apache/incubator-iceberg/issues/118#issuecomment-469438013
 
 
   Some of the guarantees made by Iceberg can't be satisfied by plain-text 
formats like CSV and JSON. The intent was for Iceberg tables to work the same 
way no matter what format is used to store the data. Here are a few problems:
   
   * These formats can't support the required schema evolution rules. Delimited 
formats (CSV, etc.) have no schema, so column resolution must be done by 
position, which breaks when columns are deleted from the middle of a schema. 
Similarly, JSON column resolution must be done by name, which breaks when a 
column is deleted and a new one is added with the same name.
   * Delimited formats also lack framing and use delimiters to handle nesting. 
This strategy breaks after a few levels of nesting so it isn't possible to use 
them for any schema.
   * Both CSV and JSON are not splittable formats, without extra rules or 
escaping. Even when using those hacks to make them splittable, they aren't 
compressible and splittable at the same time.
   
   Basically, these formats aren't suitable for tables that make guarantees 
about schema evolution or have features like splittability.
   
   We could add a different mode to support some of these, but I don't see 
enough value in it. I think the right path is to use a Spark or Hive table for 
CSV data and load it into an Iceberg table for long-term storage to get 
reliability and performance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-iceberg] rdblue commented on issue #118: Support plaintext Data (CSV, TSV, etc.) in Iceberg Tables

Reply via email to