[
https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605279#comment-16605279
]
ASF GitHub Bot commented on FLINK-7243:
---------------------------------------
lvhuyen commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input
format
URL: https://github.com/apache/flink/pull/6483#issuecomment-418967608
@HuangZhenQiu
Here is the schema of that parquet file, printed in Zeppelin.
> root
> |-- metrics_date: timestamp (nullable = true)
> |-- counter: long (nullable = true)
> |-- meter: double (nullable = true)
> |-- customer_id: string (nullable = true)
I also attach that sample file here:
[https://github.com/lvhuyen/flink/blob/parquet_input_format(7243)/flink-formats/flink-parquet/src/test/resources/test.parquet](url
)
I tried to debug in IntelliJ, that column is in fact stored as primitive
type int96 (not 64), and as Apache's
[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderImpl.java](url),
int96 is treated as a String (line 274). The way they converted from ByteArray
into a String at line 393 of
[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java](url)
seems to be irreversible and leads to data loss (my data has metrics_date =
2018-09-01 15:02:55.0, which was read as a bytes array of [0, 118, -95, -103,
69, 49, 0, 0, -5, -126, 37, 0]. After that line 393, I got a string with length
= 12 which has the same character at 3, 4, 9, and 10th position.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Add ParquetInputFormat
> ----------------------
>
> Key: FLINK-7243
> URL: https://issues.apache.org/jira/browse/FLINK-7243
> Project: Flink
> Issue Type: Sub-task
> Components: Table API & SQL
> Reporter: godfrey he
> Assignee: Zhenqiu Huang
> Priority: Major
> Labels: pull-request-available
>
> Add a {{ParquetInputFormat}} to read data from a Apache Parquet file.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)