Cheng Lian created PARQUET-363:
----------------------------------
Summary: Cannot construct empty MessageType for
ReadContext.requestedSchema
Key: PARQUET-363
URL: https://issues.apache.org/jira/browse/PARQUET-363
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.8.0, 1.8.1
Reporter: Cheng Lian
In parquet-mr 1.8.1, constructing empty {{GroupType}} (and thus
{{MessageType}}) is not allowed anymore (see PARQUET-278). This change makes
sense in most cases since Parquet doesn't support empty groups. However, there
is one use case where an empty {{MessageType}} is valid, namely passing an
empty {{MessageType}} as the {{requestedSchema}} constructor argument of
{{ReadContext}} when counting rows in a Parquet file. The reason why it works
is that, Parquet can retrieve row count from block metadata without
materializing any columns. Take the following PySpark shell snippet
([1.5-SNAPSHOT|https://github.com/apache/spark/commit/010b03ed52f35fd4d426d522f8a9927ddc579209],
which uses parquet-mr 1.7.0) as an example:
{noformat}
>>> path = 'file:///tmp/foo'
>>> # Writes 10 integers into a Parquet file
>>> sqlContext.range(10).coalesce(1).write.mode('overwrite').parquet(path)
>>> sqlContext.read.parquet(path).count()
10
{noformat}
Parquet related log lines:
{noformat}
15/08/21 12:32:04 INFO CatalystReadSupport: Going to read the following fields
from the Parquet file:
Parquet form:
message root {
}
Catalyst form:
StructType()
15/08/21 12:32:04 INFO InternalParquetRecordReader: RecordReader initialized
will read a total of 10 records.
15/08/21 12:32:04 INFO InternalParquetRecordReader: at row 0. reading next block
15/08/21 12:32:04 INFO InternalParquetRecordReader: block read in memory in 0
ms. row count = 10
{noformat}
We can see that Spark SQL passes no requested columns to the underlying Parquet
reader. What happens here is that:
# Spark SQL creates a {{CatalystRowConverter}} with zero converters (and thus
only generates empty rows).
# {{InternalParquetRecordReader}} first obtain the row count from block
metadata
([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L184-L186]).
# {{MessageColumnIO}} returns an {{EmptyRecordRecorder}} for reading the
Parquet file
([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]).
# {{InternalParquetRecordReader.nextKeyValue()}} is invoked _n_ times, where
_n_ equals to the row count. Each time, it invokes the converter created by
Spark SQL and produces an empty Spark SQL row object.
This issue is also the cause of HIVE-11611. Because when upgrading to Parquet
1.8.1, Hive worked around this issue by using {{tableSchema}} as
{{requestedSchema}} when no columns are requested
([here|https://github.com/apache/hive/commit/3e68cdc9962cacab59ee891fcca6a736ad10d37d#diff-cc764a8828c4acc2a27ba717610c3f0bR233]).
IMO this introduces a performance regression in cases like counting, because
now we need to materialize all columns just for counting.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)