izchen opened a new pull request #29600:
URL: https://github.com/apache/spark/pull/29600


   ### What changes were proposed in this pull request?
   
   Added conversion logic from original data types in Parquet files to Spark 
internal data types.
   
   The conversion relationship is as follows:
   
   | PrimitiveTypeName    | OriginalType        | Boolean | Float | Double | 
Byte int8 | Short int16 | Integer int32 | Date int32 | Decimal long or 
bigDecimal | Long int64 | Timestamp int64 | String byteArray | Binary byteArray 
|
   | -------------------- | ------------------- | ------- | ----- | ------ | 
--------- | ----------- | ------------- | ---------- | 
-------------------------- | ---------- | --------------- | ---------------- | 
---------------- |
   | BOOLEAN              | /                   | 1       |       |        |    
       |             |               |            |                            
|            |                 | 2                |                  |
   | FLOAT                | /                   |         | 1     | 2      | 3  
       | 3           | 3             |            | 3                          
| 3          |                 | 2                |                  |
   | DOUBLE               | /                   |         | 3     | 1      | 3  
       | 3           | 3             |            | 3                          
| 3          |                 | 2                |                  |
   | INT32                | null (INT32)        |         | 3     | 2      | 2  
       | 2           | 1             |            | 2                          
| 2          |                 | 2                |                  |
   | INT32                | INT_8               |         | 2     | 2      | 1  
       | 2           | 2             |            | 2                          
| 2          |                 | 2                |                  |
   | INT32                | INT_16              |         | 2     | 2      | 2  
       | 1           | 2             |            | 2                          
| 2          |                 | 2                |                  |
   | INT32                | INT_32              |         | 3     | 2      | 2  
       | 2           | 1             |            | 2                          
| 2          |                 | 2                |                  |
   | INT32                | DATE                |         |       |        |    
       |             |               | 1          |                            
|            | 2               | 2                |                  |
   | INT32                | DECIMAL             |         | 3     | 3      | 
2,3       | 2,3         | 2,3           |            | 1,2,3                    
  | 2,3        |                 | 2                |                  |
   | INT64                | null (INT64)        |         | 3     | 3      | 2  
       | 2           | 2             |            | 2                          
| 1          |                 | 2                |                  |
   | INT64                | INT_64              |         | 3     | 3      | 2  
       | 2           | 2             |            | 2                          
| 1          |                 | 2                |                  |
   | INT64                | DECIMAL             |         | 3     | 3      | 
2,3       | 2,3         | 2,3           |            | 1,2,3                    
  | 2,3        |                 | 2                |                  |
   | INT64                | TIMESTAMP_MILLIS    |         |       |        |    
       |             |               | 3          |                            
|            | 1               | 2                |                  |
   | INT64                | TIMESTAMP_MICROS    |         |       |        |    
       |             |               | 3          |                            
|            | 1               | 2                |                  |
   | INT96                | /                   |         |       |        |    
       |             |               |            |                            
|            |                 |                  |                  |
   | INT96                | / (assumeTimestamp) |         |       |        |    
       |             |               | 3          |                            
|            | 1               | 2                |                  |
   | BINARY               | null                |         |       |        |    
       |             |               |            |                            
|            |                 |                  | 1                |
   | BINARY               | null (assumeString) |         | 3     | 3      | 2  
       | 2           | 2             | 2          | 3                          
| 2          | 2               | 1                | 2                |
   | BINARY               | UTF8                |         | 3     | 3      | 2  
       | 2           | 2             | 2          | 3                          
| 2          | 2               | 1                | 2                |
   | BINARY               | ENUM                |         | 3     | 3      | 2  
       | 2           | 2             | 2          | 3                          
| 2          | 2               | 1                | 2                |
   | BINARY               | JSON                |         |       |        |    
       |             |               |            |                            
|            |                 | 1                | 2                |
   | BINARY               | BSON                |         |       |        |    
       |             |               |            |                            
|            |                 |                  | 1                |
   | BINARY               | DECIMAL             |         | 3     | 3      | 
2,3       | 2,3         | 2,3           |            | 1,2,3                    
  | 2,3        |                 | 2                |                  |
   | FIXED_LEN_BYTE_ARRAY | DECIMAL             |         | 3     | 3      | 
2,3       | 2,3         | 2,3           |            | 1,2,3                    
  | 2,3        |                 | 2                |                  |
   | /                    | UINT_8              |         |       |        |    
       |             |               |            |                            
|            |                 |                  |                  |
   | /                    | UINT_64             |         |       |        |    
       |             |               |            |                            
|            |                 |                  |                  |
   | /                    | UINT_32             |         |       |        |    
       |             |               |            |                            
|            |                 |                  |                  |
   | /                    | UINT_16             |         |       |        |    
       |             |               |            |                            
|            |                 |                  |                  |
   | /                    | TIME_MILLIS         |         |       |        |    
       |             |               |            |                            
|            |                 |                  |                  |
   | /                    | TIME_MICROS         |         |       |        |    
       |             |               |            |                            
|            |                 |                  |                  |
   | /                    | INTERVAL            |         |       |        |    
       |             |               |            |                            
|            |                 |                  |                  |
   
   The type mapping marked 1, 2 and 3 in the table can be converted.
   
   1 -> The type corresponds strictly
   
   2 -> The type does not correspond, but it can be converted without side 
effects
   
   3 -> The type does not correspond, the data precision may be lost during the 
conversion process
   
   For the decimal type, Spark needs to match the label according to its scale 
and precision attributes.
   
   **There are five code changes:**
   
   - Added a user configuration
   
     Users can use the data source option `conversionMode` or global SQL option 
`spark.sql.parquet.conversionMode` to set the data conversion mode. There are 
three modes available.
   
     `MATCH`(default): Only convert label 1 in the table.
   
     `NO_SIDE_EFFECTS`: Convert labels 1 and 2 in the table.
   
     `LOSS_PRECISION`: Convert all labels 1, 2 and 3 in the table.
   
   - Added instructions for conversion
   
     *In docs/sql-data-sources-parquet.md*.
   
   - Added a checker to determine whether it can be converted
   
     The checker will obtain the Spark `requiredSchema` and the schema of the 
parquet file footer metadata. The checker checks the mapping relationship 
between the two schemas according to the `conversionMode` to determine whether 
the type conversion is allowed.
   
     If the conversion is not allowed, an exception is thrown.
   
     *The main logic is in 
`org.apache.spark.sql.execution.datasources.parquet.SparkParquetSchemaChecker`*
   
   - Added conversion logic for spark.sql.parquet.enableVectorizedReader=false
   
     *The main logic is in 
`org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter`*
   
   - Added conversion logic for spark.sql.parquet.enableVectorizedReader=true
   
     *The main logic is in 
`org.apache.spark.sql.execution.datasources.parquet.ParquetColumnConverter`*
   
   ### Why are the changes needed?
   
   Spark uses `requiredSchema` to convert the read parquet file data. It does 
not consider that the `requiredSchema` and the schema stored in the file may be 
different.
   
   This may potentially cause data correctness problems. For example 
([SPARK-32317](https://issues.apache.org/jira/browse/SPARK-32317)), this 
decimal data correctness problem.
   
   I think it is necessary to consider the mapping between the `requiredSchema` 
and the schema stored in the file. 
   
   
   ### Does this PR introduce any user-facing change?
   
   1. User configuration
   
      Users can use the data source option `conversionMode` or global SQL 
option `spark.sql.parquet.conversionMode` to set the data conversion mode. 
There are three modes available:`MATCH`, `NO_SIDE_EFFECTS`, `LOSS_PRECISION` .
   
   2. Runtime exception
   
      - The checker checks the schema mapping relationship between spark and 
parquet according to the conversionMode to determine whether the type 
conversion is allowed. If the conversion relationship is not allowed, throw an 
exception.
      - When an error occurs or data overflow occurs during data conversion, an 
exception is thrown, instead of implicitly assigning the wrong data to null.
   
   ### How was this patch tested?
   
   // TODO  Added UTs
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to