[
https://issues.apache.org/jira/browse/PARQUET-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238881#comment-17238881
]
Dongjoon Hyun commented on PARQUET-1946:
----------------------------------------
BTW, Spark 3.0/2.4 use Parquet 1.10.1.
> Parquet File not readable by Google big query (works with Spark)
> ----------------------------------------------------------------
>
> Key: PARQUET-1946
> URL: https://issues.apache.org/jira/browse/PARQUET-1946
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.11.0
> Environment: [secor|https://github.com/pinterest/secor]
> GCP
> Big Query google cloud
> Parquet writer 1.11
>
>
> Reporter: Richard Grossman
> Priority: Blocker
>
> Hi
> I'm trying to write Avro message to parquet on GCS. These parquet should be
> query by big query engine who support now parquet.
> To do this I'm using Secor a kafka log persister tools from pinterest.
> First I didn't notice any problem using Spark the same file can be read
> without any problem all is working perfect.
> Now using Big query bring and error like this :
> Error while reading table: , error message: Read less values than expected:
> Actual: 29333, Expected: 33827. Row group: 0, Column: , File:
> After investigation using parquet-tools I figured out that in parquet there
> is metadata regarding number total of unique values for each columns eg from
> parquet-tools
> page 0: DLE:BIT_PACKED RLE:BIT_PACKED [more]... CRC:[PAGE CORRUPT] VC:547
> So the VC value indicate that the total number of unique value in the file is
> 547.
> Now when make a spark SQL like SELECT DISTINCT COUNT(column) FROM ... I get
> 421 mean this number in the metadata is incorrect.
> So what is not a problem for Spark to read is a blocking problem for Big data
> because it relies on these values and found it incorrect.
> Is there any configuration of the writer that can prevent these errors in the
> metadata ? Or is it a normal behavior that should be a problem ?
> Thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)