GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/14388
[SPARK-16362][SQL] Support ArrayType and StructType in vectorized Parquet
reader
## What changes were proposed in this pull request?
Vectorization parquet reader now doesn't support complex types such as
ArrayType, MapType and StructType. We should support it to extend the coverage
of performance improvement introduced by vectorization parquet reader. This
patch is to add ArrayType and StructType first.
### Main changes
* Obtain repetition and definition level information for Parquet schema
In order to support complex types in vectorized Parquet reader, we need
to use repetition and definition level information for Parquet schema which are
used to encoded the structure of complex types. This PR introduces a class to
capture these encoding: `RepetitionDefinitionInfo`. This PR also introduces few
classes to capture Parquet schema structure: `ParquetField`, `ParquetStruct`,
`ParquetArray` and `ParquetMap`. A new method `getParquetStruct` is added to
`ParquetSchemaConverter` which is used to create a `ParquetStruct` object which
captures the structure and metadata. The `ParquetStruct` has the same schema
structure as the required schema used to guide Parquet reading. It is used to
provide the corresponding repetition and definition levels for the fields in
the required schema.
* Attach `VectorizedColumnReader` to `ColumnVector`
Because in flat schema each `ColumnVector` is actually a data column,
previously the relation between `VectorizedColumnReader` and `ColumnVector` is
one-by-one. Now only the `ColumnVector` representing a data column will have
corresponding `VectorizedColumnReader`. Then when it is time to read batch, the
`ColumnVector` with complex type will delegate to its child `ColumnVector`.
* Implement constructing complex records in `VectorizedColumnReader`
The `readBatch` in `VectorizedColumnReader` is the main method to read
data into `ColumnVector`. Previously its behavior is simply to load required
number of data according to the data type of the column vector. Now after the
data is loaded into the column, we need to construct complex records in its
parent column that could be an ArrayType, MapType or StructType. The way to
restore the data as complex types is encoding in repetition and definition
levels in Parquet. The new method `constructComplexRecords` in
`VectorizedColumnReader` implements the logic to restore the complex data.
Basically, what `constructComplexRecords` does is to count the continuous
values and add array into the parent column if the repetition level value
indicates a new record happens. Besides, `constructComplexRecords` also needs
to consider the null values. Null values could mean a null record at root
level, an empty array or struct. This method considers different cases and sets
it correctly.
### Benchmark
val N = 10000
withParquetTable((0 until N).map { i =>
((i to i + 1000).toList, (i to i + 100).map(_.toString).toList,
(i to i + 1000).map(_.toDouble / 2).toList,
((0 to 10).map(_.toString).toList, (0 to
10).map(_.toString).toList))
}, "t") {
val benchmark = new Benchmark("Vectorization Parquet for nested
types", N)
benchmark.addCase("Vectorization Parquet reader", 10) { iter =>
sql("SELECT _1[10], _2[20], _3[30], _4._1[5], _4._2[5] FROM
t").collect()
}
benchmark.run()
}
Disabled vectorization:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux
3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Vectorization Parquet for nested types: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Vectorization Parquet reader 1706 / 2207 0.0
170580.8 1.0X
Enabled vectorization:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux
3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Vectorization Parquet for nested types: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Vectorization Parquet reader 789 / 972 0.0
78919.4 1.0X
## How was this patch tested?
Jenkins tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 vectorized-parquet-complex-type
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14388.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14388
----
commit 8cfeb7e74843d8674c5354a67a7fc4f9d45100dd
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-07-27T09:32:18Z
Add ArrayType, StructType support to vectorized Parquet reader.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]