Richard Grossman created PARQUET-1946:
-----------------------------------------

             Summary: Parquet File not readable by Google big query (works with 
Spark)
                 Key: PARQUET-1946
                 URL: https://issues.apache.org/jira/browse/PARQUET-1946
             Project: Parquet
          Issue Type: Bug
          Components: parquet-avro
    Affects Versions: 1.11.0
         Environment: [secor|https://github.com/pinterest/secor]

GCP 

Big Query google cloud

Parquet writer 1.11

 

 
            Reporter: Richard Grossman


Hi
I'm trying to write Avro message to parquet on GCS. These parquet should be 
query by big query engine who support now parquet.

To do this I'm using Secor a kafka log persister tools from pinterest.


First I didn't notice any problem using Spark the same file can be read without 
any problem all is working perfect.


Now using Big query bring and error like this :
Error while reading table: , error message: Read less values than expected: 
Actual: 29333, Expected: 33827. Row group: 0, Column: , File:

After investigation using parquet-tools I figured out that in parquet there is 
metadata regarding number total of unique values for each columns eg from 
parquet-tools
page 0: DLE:BIT_PACKED RLE:BIT_PACKED [more]... CRC:[PAGE CORRUPT] VC:547

So the VC value indicate that the total number of unique value in the file is 
547.


Now when make a spark SQL like SELECT DISTINCT COUNT(column) FROM ... I get 421 
mean this number in the metadata is incorrect.


So what is not a problem for Spark to read is a blocking problem for Big data 
because it relies on these values and found it incorrect.

Is there any configuration of the writer that can prevent these errors in the 
metadata ? Or is it a normal behavior that should be a problem ?

Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to