[
https://issues.apache.org/jira/browse/KUDU-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke updated KUDU-2454:
------------------------------
Component/s: spark
> Avro Import/Export does not round trip
> --------------------------------------
>
> Key: KUDU-2454
> URL: https://issues.apache.org/jira/browse/KUDU-2454
> Project: Kudu
> Issue Type: Bug
> Components: spark
> Affects Versions: 1.5.0
> Reporter: Grant Henke
> Priority: Critical
>
> When exporting to Avro columns with type Byte or Short are treated as
> Integers because Avro doesn't have a Byte or Short type. When re-importing
> the data, the job fails because the column types do not match.
> Ideally spark-avro would solve this by safely casting the values back to the
> smaller type. Guava has utilities to make this straightforward. (ex.
> Shorts.checkedCast(i)). We could send a pull request to spark-avro to fix
> this, or add some special handling to the Kudu side to handle the safe
> downconversion.
> Another type issue when exporting is that Decimal values are written as
> Strings instead of BigDecimal logical types. There are a few un-merged pull
> request to fix that here:
> * [https://github.com/databricks/spark-avro/pull/276]
> * [https://github.com/databricks/spark-avro/pull/121]
> Additionally Timestamp values are written as longs instead of Timestamp
> logical types (timestamp-micros). This is a data corruption issue because the
> long [value that is
> output|https://github.com/databricks/spark-avro/blob/0764d699015975acf87dc5210cca8a43db84196a/src/main/scala/com/databricks/spark/avro/AvroOutputWriter.scala#L103]
> is in milliseconds (Timestamp.getTime()) but the expected long value for a
> Kudu Timestamp column should be in microseconds.
> Given all these issues, ImportExportFiles needs a lot more test coverage
> before we suggest it's use. Currently it only tests importing Strings form a
> CSV and does not test Avro or parquet support.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)