[GitHub] spark pull request #14388: [SPARK-16362][SQL] Support ArrayType and StructTy...

viirya Thu, 28 Jul 2016 00:18:07 -0700

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/14388


    [SPARK-16362][SQL] Support ArrayType and StructType in vectorized Parquet 
reader

    ## What changes were proposed in this pull request?
    
    Vectorization parquet reader now doesn't support complex types such as 
ArrayType, MapType and StructType. We should support it to extend the coverage 
of performance improvement introduced by vectorization parquet reader. This 
patch is to add ArrayType and StructType first.
    
    ### Main changes
    
    * Obtain repetition and definition level information for Parquet schema
    
      In order to support complex types in vectorized Parquet reader, we need 
to use repetition and definition level information for Parquet schema which are 
used to encoded the structure of complex types. This PR introduces a class to 
capture these encoding: `RepetitionDefinitionInfo`. This PR also introduces few 
classes to capture Parquet schema structure: `ParquetField`, `ParquetStruct`, 
`ParquetArray` and `ParquetMap`. A new method `getParquetStruct` is added to 
`ParquetSchemaConverter` which is used to create a `ParquetStruct` object which 
captures the structure and metadata. The `ParquetStruct` has the same schema 
structure as the required schema used to guide Parquet reading. It is used to 
provide the corresponding repetition and definition levels for the fields in 
the required schema.
    
    * Attach `VectorizedColumnReader`  to `ColumnVector`
    
      Because in flat schema each `ColumnVector` is actually a data column, 
previously the relation between `VectorizedColumnReader` and `ColumnVector` is 
one-by-one. Now only the `ColumnVector` representing a data column will have 
corresponding `VectorizedColumnReader`. Then when it is time to read batch, the 
`ColumnVector` with complex type will delegate to its child `ColumnVector`.
    
    * Implement constructing complex records in `VectorizedColumnReader`
    
      The `readBatch` in `VectorizedColumnReader` is the main method to read 
data into `ColumnVector`. Previously its behavior is simply to load required 
number of data according to the data type of the column vector. Now after the 
data is loaded into the column, we need to construct complex records in its 
parent column that could be an ArrayType, MapType or StructType. The way to 
restore the data as complex types is encoding in repetition and definition 
levels in Parquet. The new method `constructComplexRecords` in 
`VectorizedColumnReader` implements the logic to restore the complex data. 
Basically, what `constructComplexRecords` does is to count the continuous 
values and add array into the parent column if the repetition level value 
indicates a new record happens. Besides, `constructComplexRecords` also needs 
to consider the null values. Null values could mean a null record at root 
level, an empty array or struct. This method considers different cases and sets 
it correctly.
    
    ### Benchmark
    
        val N = 10000
        withParquetTable((0 until N).map { i =>
          ((i to i + 1000).toList, (i to i + 100).map(_.toString).toList,
            (i to i + 1000).map(_.toDouble / 2).toList,
            ((0 to 10).map(_.toString).toList, (0 to 
10).map(_.toString).toList))
        }, "t") {
          val benchmark = new Benchmark("Vectorization Parquet for nested 
types", N)
          benchmark.addCase("Vectorization Parquet reader", 10) { iter =>
            sql("SELECT _1[10], _2[20], _3[30], _4._1[5], _4._2[5] FROM 
t").collect()
          }
          benchmark.run()
        }
    
    Disabled vectorization:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-25-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        Vectorization Parquet for nested types:  Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        Vectorization Parquet reader                  1706 / 2207          0.0  
    170580.8       1.0X
    
    Enabled vectorization:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-25-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        Vectorization Parquet for nested types:  Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        Vectorization Parquet reader                   789 /  972          0.0  
     78919.4       1.0X
    
    
    ## How was this patch tested?
    
    Jenkins tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 vectorized-parquet-complex-type

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14388.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14388
    
----
commit 8cfeb7e74843d8674c5354a67a7fc4f9d45100dd
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-07-27T09:32:18Z

    Add ArrayType, StructType support to vectorized Parquet reader.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14388: [SPARK-16362][SQL] Support ArrayType and StructTy...

Reply via email to