Cheng Lian created PARQUET-363:
----------------------------------

             Summary: Cannot construct empty MessageType for 
ReadContext.requestedSchema
                 Key: PARQUET-363
                 URL: https://issues.apache.org/jira/browse/PARQUET-363
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.8.0, 1.8.1
            Reporter: Cheng Lian


In parquet-mr 1.8.1, constructing empty {{GroupType}} (and thus 
{{MessageType}}) is not allowed anymore (see PARQUET-278). This change makes 
sense in most cases since Parquet doesn't support empty groups. However, there 
is one use case where an empty {{MessageType}} is valid, namely passing an 
empty {{MessageType}} as the {{requestedSchema}} constructor argument of 
{{ReadContext}} when counting rows in a Parquet file. The reason why it works 
is that, Parquet can retrieve row count from block metadata without 
materializing any columns. Take the following PySpark shell snippet 
([1.5-SNAPSHOT|https://github.com/apache/spark/commit/010b03ed52f35fd4d426d522f8a9927ddc579209],
 which uses parquet-mr 1.7.0) as an example:
{noformat}
>>> path = 'file:///tmp/foo'
>>> # Writes 10 integers into a Parquet file
>>> sqlContext.range(10).coalesce(1).write.mode('overwrite').parquet(path)
>>> sqlContext.read.parquet(path).count()

10
{noformat}
Parquet related log lines:
{noformat}
15/08/21 12:32:04 INFO CatalystReadSupport: Going to read the following fields 
from the Parquet file:

Parquet form:
message root {
}


Catalyst form:
StructType()

15/08/21 12:32:04 INFO InternalParquetRecordReader: RecordReader initialized 
will read a total of 10 records.
15/08/21 12:32:04 INFO InternalParquetRecordReader: at row 0. reading next block
15/08/21 12:32:04 INFO InternalParquetRecordReader: block read in memory in 0 
ms. row count = 10
{noformat}
We can see that Spark SQL passes no requested columns to the underlying Parquet 
reader. What happens here is that:

# Spark SQL creates a {{CatalystRowConverter}} with zero converters (and thus 
only generates empty rows).
# {{InternalParquetRecordReader}} first obtain the row count from block 
metadata 
([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L184-L186]).
# {{MessageColumnIO}} returns an {{EmptyRecordRecorder}} for reading the 
Parquet file 
([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]).
# {{InternalParquetRecordReader.nextKeyValue()}} is invoked _n_ times, where 
_n_ equals to the row count. Each time, it invokes the converter created by 
Spark SQL and produces an empty Spark SQL row object.

This issue is also the cause of HIVE-11611.  Because when upgrading to Parquet 
1.8.1, Hive worked around this issue by using {{tableSchema}} as 
{{requestedSchema}} when no columns are requested 
([here|https://github.com/apache/hive/commit/3e68cdc9962cacab59ee891fcca6a736ad10d37d#diff-cc764a8828c4acc2a27ba717610c3f0bR233]).
 IMO this introduces a performance regression in cases like counting, because 
now we need to materialize all columns just for counting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to