[jira] [Commented] (PARQUET-1946) Parquet File not readable by Google big query (works with Spark)

Dongjoon Hyun (Jira) Wed, 25 Nov 2020 11:10:35 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238879#comment-17238879
 ]


Dongjoon Hyun commented on PARQUET-1946:
----------------------------------------

Hi, [~richiesgr] This is only for Parquet 1.11.0 right? Did you try to use 
Parquet 1.11.1?

> Parquet File not readable by Google big query (works with Spark)
> ----------------------------------------------------------------
>
>                 Key: PARQUET-1946
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1946
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.11.0
>         Environment: [secor|https://github.com/pinterest/secor]
> GCP 
> Big Query google cloud
> Parquet writer 1.11
>  
>  
>            Reporter: Richard Grossman
>            Priority: Blocker
>
> Hi
> I'm trying to write Avro message to parquet on GCS. These parquet should be 
> query by big query engine who support now parquet.
> To do this I'm using Secor a kafka log persister tools from pinterest.
> First I didn't notice any problem using Spark the same file can be read 
> without any problem all is working perfect.
> Now using Big query bring and error like this :
> Error while reading table: , error message: Read less values than expected: 
> Actual: 29333, Expected: 33827. Row group: 0, Column: , File:
> After investigation using parquet-tools I figured out that in parquet there 
> is metadata regarding number total of unique values for each columns eg from 
> parquet-tools
> page 0: DLE:BIT_PACKED RLE:BIT_PACKED [more]... CRC:[PAGE CORRUPT] VC:547
> So the VC value indicate that the total number of unique value in the file is 
> 547.
> Now when make a spark SQL like SELECT DISTINCT COUNT(column) FROM ... I get 
> 421 mean this number in the metadata is incorrect.
> So what is not a problem for Spark to read is a blocking problem for Big data 
> because it relies on these values and found it incorrect.
> Is there any configuration of the writer that can prevent these errors in the 
> metadata ? Or is it a normal behavior that should be a problem ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1946) Parquet File not readable by Google big query (works with Spark)

Reply via email to