[GitHub] [flink] lvhuyen commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format

GitHub Thu, 06 Sep 2018 01:33:49 -0700

@HuangZhenQiu 
Here is the schema of that parquet file, printed in Zeppelin.
> root
>  |-- metrics_date: timestamp (nullable = true)
>  |-- counter: long (nullable = true)
>  |-- meter: double (nullable = true)
>  |-- customer_id: string (nullable = true)
I also attach that sample file here: 
[https://github.com/lvhuyen/flink/blob/parquet_input_format(7243)/flink-formats/flink-parquet/src/test/resources/test.parquet](url
)


I tried to debug in IntelliJ, that column is in fact stored as primitive type 
int96 (not 64), and as Apache's 
[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderImpl.java](url),
 int96 is treated as a Binary (line 274). As per your current implementation in 
RowCoverter class, a Binary is converted into a String using UTF-8, which seems 
to be irreversible and leads to data loss (my data has metrics_date = 
2018-09-01 15:02:55.0, which was read as a bytes array of [0, 118, -95, -103, 
69, 49, 0, 0, -5, -126, 37, 0] the got converted to a string with length = 12 
which has the same character at 3, 4, 9, and 10th position. 
Should that possible to modify the method 
RowConverter.RowPrimitiveConverter.addBinary() to handle String / BigInteger 
differently?

[ Full content available at: https://github.com/apache/flink/pull/6483 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [flink] lvhuyen commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format

Reply via email to