[jira] [Commented] (ARROW-9735) [Rust] [Parquet] Corrupt footer error on files produced by AvroParquetWriter

Chao Sun (Jira) Thu, 13 Aug 2020 18:36:54 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177413#comment-17177413
 ]


Chao Sun commented on ARROW-9735:
---------------------------------

Thanks [~sergiimk] - let me try to reproduce the issue.

> [Rust] [Parquet] Corrupt footer error on files produced by AvroParquetWriter
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-9735
>                 URL: https://issues.apache.org/jira/browse/ARROW-9735
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust, Rust - DataFusion
>            Reporter: Sergii Mikhtoniuk
>            Priority: Major
>         Attachments: data.snappy.parquet
>
>
> I started to use rust parquet library for some basic reading of files 
> produced by Spark and was very happy with the performance, however when I try 
> to read any Parquet files produced by my Flink app I get a panic:
> {code:java}
> General("Invalid Parquet file. Corrupt footer") {code}
> I'm attaching the sample file: [^data.snappy.parquet]
> Output of 'parquet-meta' command (from parquet-tools package):
> {code:java}
> creator:        parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a) 
> extra:          parquet.avro.schema = 
> {"type":"record","name":"Row","fields":[{"name":"system_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"event_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"city","type":["string","null"]},{"name":"population_x10","type":["int","null"]}]}
>  
> extra:          writer.model.name = avro file schema:    Row 
> -------------------------------------------------------------------------------------------------------------------
> system_time:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
> event_time:     OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
> city:           OPTIONAL BINARY L:STRING R:0 D:1
> population_x10: OPTIONAL INT32 R:0 D:1row group 1:    RC:3 TS:291 OFFSET:4 
> -------------------------------------------------------------------------------------------------------------------
> system_time:     INT64 SNAPPY DO:0 FPO:4 SZ:94/90/0.96 VC:3 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:36:12.413+0000, 
> max: 2020-08-14T00:36:12.413+0000, num_nulls: 0]
> event_time:      INT64 SNAPPY DO:0 FPO:98 SZ:94/90/0.96 VC:3 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:35:42.709+0000, 
> max: 2020-08-14T00:35:42.709+0000, num_nulls: 0]
> city:            BINARY SNAPPY DO:0 FPO:192 SZ:50/48/0.96 VC:3 
> ENC:PLAIN,BIT_PACKED,RLE ST:[min: A, max: C, num_nulls: 0]
> population_x10:  INT32 SNAPPY DO:0 FPO:242 SZ:65/63/0.97 VC:3 
> ENC:PLAIN,BIT_PACKED,RLE ST:[min: 10000, max: 30000, num_nulls: 0] {code}
> The file is produced in Apache Flink + Scala using (fairly recent):
> {code:java}
> "org.apache.parquet" % "parquet-avro" % "1.10.0" {code}
> Code that produces the file goes like this:
> {code:java}
> val writer = AvroParquetWriter
> .builder[GenericRecord](new Path(path))
> .withSchema(avroSchema)
> .withDataModel(model)
> .withCompressionCodec(CompressionCodecName.SNAPPY)
> .build()
> for (row <- rows) { writer.write(row) }
> writer.close() {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9735) [Rust] [Parquet] Corrupt footer error on files produced by AvroParquetWriter

Reply via email to