[
https://issues.apache.org/jira/browse/ARROW-9735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177413#comment-17177413
]
Chao Sun commented on ARROW-9735:
---------------------------------
Thanks [~sergiimk] - let me try to reproduce the issue.
> [Rust] [Parquet] Corrupt footer error on files produced by AvroParquetWriter
> ----------------------------------------------------------------------------
>
> Key: ARROW-9735
> URL: https://issues.apache.org/jira/browse/ARROW-9735
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust, Rust - DataFusion
> Reporter: Sergii Mikhtoniuk
> Priority: Major
> Attachments: data.snappy.parquet
>
>
> I started to use rust parquet library for some basic reading of files
> produced by Spark and was very happy with the performance, however when I try
> to read any Parquet files produced by my Flink app I get a panic:
> {code:java}
> General("Invalid Parquet file. Corrupt footer") {code}
> I'm attaching the sample file: [^data.snappy.parquet]
> Output of 'parquet-meta' command (from parquet-tools package):
> {code:java}
> creator: parquet-mr version 1.10.0 (build
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra: parquet.avro.schema =
> {"type":"record","name":"Row","fields":[{"name":"system_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"event_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"city","type":["string","null"]},{"name":"population_x10","type":["int","null"]}]}
>
> extra: writer.model.name = avro file schema: Row
> -------------------------------------------------------------------------------------------------------------------
> system_time: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
> event_time: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
> city: OPTIONAL BINARY L:STRING R:0 D:1
> population_x10: OPTIONAL INT32 R:0 D:1row group 1: RC:3 TS:291 OFFSET:4
> -------------------------------------------------------------------------------------------------------------------
> system_time: INT64 SNAPPY DO:0 FPO:4 SZ:94/90/0.96 VC:3
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:36:12.413+0000,
> max: 2020-08-14T00:36:12.413+0000, num_nulls: 0]
> event_time: INT64 SNAPPY DO:0 FPO:98 SZ:94/90/0.96 VC:3
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:35:42.709+0000,
> max: 2020-08-14T00:35:42.709+0000, num_nulls: 0]
> city: BINARY SNAPPY DO:0 FPO:192 SZ:50/48/0.96 VC:3
> ENC:PLAIN,BIT_PACKED,RLE ST:[min: A, max: C, num_nulls: 0]
> population_x10: INT32 SNAPPY DO:0 FPO:242 SZ:65/63/0.97 VC:3
> ENC:PLAIN,BIT_PACKED,RLE ST:[min: 10000, max: 30000, num_nulls: 0] {code}
> The file is produced in Apache Flink + Scala using (fairly recent):
> {code:java}
> "org.apache.parquet" % "parquet-avro" % "1.10.0" {code}
> Code that produces the file goes like this:
> {code:java}
> val writer = AvroParquetWriter
> .builder[GenericRecord](new Path(path))
> .withSchema(avroSchema)
> .withDataModel(model)
> .withCompressionCodec(CompressionCodecName.SNAPPY)
> .build()
> for (row <- rows) { writer.write(row) }
> writer.close() {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)