[
https://issues.apache.org/jira/browse/PARQUET-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ryan Blue resolved PARQUET-363.
-------------------------------
Resolution: Fixed
Assignee: Ryan Blue
Fix Version/s: 1.9.0
> Cannot construct empty MessageType for ReadContext.requestedSchema
> ------------------------------------------------------------------
>
> Key: PARQUET-363
> URL: https://issues.apache.org/jira/browse/PARQUET-363
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.8.0, 1.8.1
> Reporter: Cheng Lian
> Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> In parquet-mr 1.8.1, constructing empty {{GroupType}} (and thus
> {{MessageType}}) is not allowed anymore (see PARQUET-278). This change makes
> sense in most cases since Parquet doesn't support empty groups. However,
> there is one use case where an empty {{MessageType}} is valid, namely passing
> an empty {{MessageType}} as the {{requestedSchema}} constructor argument of
> {{ReadContext}} when counting rows in a Parquet file. The reason why it works
> is that, Parquet can retrieve row count from block metadata without
> materializing any columns. Take the following PySpark shell snippet
> ([1.5-SNAPSHOT|https://github.com/apache/spark/commit/010b03ed52f35fd4d426d522f8a9927ddc579209],
> which uses parquet-mr 1.7.0) as an example:
> {noformat}
> >>> path = 'file:///tmp/foo'
> >>> # Writes 10 integers into a Parquet file
> >>> sqlContext.range(10).coalesce(1).write.mode('overwrite').parquet(path)
> >>> sqlContext.read.parquet(path).count()
> 10
> {noformat}
> Parquet related log lines:
> {noformat}
> 15/08/21 12:32:04 INFO CatalystReadSupport: Going to read the following
> fields from the Parquet file:
> Parquet form:
> message root {
> }
> Catalyst form:
> StructType()
> 15/08/21 12:32:04 INFO InternalParquetRecordReader: RecordReader initialized
> will read a total of 10 records.
> 15/08/21 12:32:04 INFO InternalParquetRecordReader: at row 0. reading next
> block
> 15/08/21 12:32:04 INFO InternalParquetRecordReader: block read in memory in 0
> ms. row count = 10
> {noformat}
> We can see that Spark SQL passes no requested columns to the underlying
> Parquet reader. What happens here is that:
> # Spark SQL creates a {{CatalystRowConverter}} with zero converters (and thus
> only generates empty rows).
> # {{InternalParquetRecordReader}} first obtain the row count from block
> metadata
> ([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L184-L186]).
> # {{MessageColumnIO}} returns an {{EmptyRecordRecorder}} for reading the
> Parquet file
> ([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]).
> # {{InternalParquetRecordReader.nextKeyValue()}} is invoked _n_ times, where
> _n_ equals to the row count. Each time, it invokes the converter created by
> Spark SQL and produces an empty Spark SQL row object.
> This issue is also the cause of HIVE-11611. Because when upgrading to
> Parquet 1.8.1, Hive worked around this issue by using {{tableSchema}} as
> {{requestedSchema}} when no columns are requested
> ([here|https://github.com/apache/hive/commit/3e68cdc9962cacab59ee891fcca6a736ad10d37d#diff-cc764a8828c4acc2a27ba717610c3f0bR233]).
> IMO this introduces a performance regression in cases like counting, because
> now we need to materialize all columns just for counting.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)