Sergii Mikhtoniuk created ARROW-9735: ----------------------------------------
Summary: [Rust] [Parquet] Corrupt footer error on files produced by AvroParquetWriter Key: ARROW-9735 URL: https://issues.apache.org/jira/browse/ARROW-9735 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Sergii Mikhtoniuk Attachments: data.snappy.parquet I started to use rust parquet library for some basic reading of files produced by Spark and was very happy with the performance, however when I try to read any Parquet files produced by my Flink app I get a panic: {code:java} General("Invalid Parquet file. Corrupt footer") {code} I'm attaching the sample file: [^data.snappy.parquet] Output of 'parquet-meta' command (from parquet-tools package): {code:java} creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: parquet.avro.schema = {"type":"record","name":"Row","fields":[{"name":"system_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"event_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"city","type":["string","null"]},{"name":"population_x10","type":["int","null"]}]} extra: writer.model.name = avro file schema: Row ------------------------------------------------------------------------------------------------------------------- system_time: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 event_time: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 city: OPTIONAL BINARY L:STRING R:0 D:1 population_x10: OPTIONAL INT32 R:0 D:1row group 1: RC:3 TS:291 OFFSET:4 ------------------------------------------------------------------------------------------------------------------- system_time: INT64 SNAPPY DO:0 FPO:4 SZ:94/90/0.96 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:36:12.413+0000, max: 2020-08-14T00:36:12.413+0000, num_nulls: 0] event_time: INT64 SNAPPY DO:0 FPO:98 SZ:94/90/0.96 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:35:42.709+0000, max: 2020-08-14T00:35:42.709+0000, num_nulls: 0] city: BINARY SNAPPY DO:0 FPO:192 SZ:50/48/0.96 VC:3 ENC:PLAIN,BIT_PACKED,RLE ST:[min: A, max: C, num_nulls: 0] population_x10: INT32 SNAPPY DO:0 FPO:242 SZ:65/63/0.97 VC:3 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 10000, max: 30000, num_nulls: 0] {code} The file is produced in Apache Flink + Scala using (fairly recent): {code:java} "org.apache.parquet" % "parquet-avro" % "1.10.0" {code} Code that produces the file goes like this: {code:java} val writer = AvroParquetWriter .builder[GenericRecord](new Path(path)) .withSchema(avroSchema) .withDataModel(model) .withCompressionCodec(CompressionCodecName.SNAPPY) .build() for (row <- rows) { writer.write(row) } writer.close() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)