[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34199: [SPARK-36935][SQL] Extend ParquetSchemaConverter to compute Parquet repetition & definition level

GitBox Thu, 07 Oct 2021 01:00:14 -0700


dongjoon-hyun commented on a change in pull request #34199:
URL: https://github.com/apache/spark/pull/34199#discussion_r723939682




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -60,40 +58,106 @@ class ParquetToSparkSchemaConverter(
   /**
    * Converts Parquet [[MessageType]] `parquetSchema` to a Spark SQL 
[[StructType]].
    */
-  def convert(parquetSchema: MessageType): StructType = 
convert(parquetSchema.asGroupType())
+  def convert(parquetSchema: MessageType): StructType = {
+    val column = new ColumnIOFactory().getColumnIO(parquetSchema)
+    val converted = convertInternal(column)
+    converted.sparkType.asInstanceOf[StructType]
+  }
 
-  private def convert(parquetSchema: GroupType): StructType = {
-    val fields = parquetSchema.getFields.asScala.map { field =>
-      field.getRepetition match {
-        case OPTIONAL =>
-          StructField(field.getName, convertField(field), nullable = true)
+  /**
+   * Convert `parquetSchema` into a [[ParquetType]] which contains its 
corresponding Spark
+   * SQL [[StructType]] along with other information such as the maximum 
repetition and definition
+   * level of each node, column descriptor for the leave nodes, etc.
+   *
+   * If `sparkReadSchema` is not empty, when deriving Spark SQL type from a 
Parquet field this will
+   * check if the same field also exists in the schema. If so, it will use the 
Spark SQL type.
+   * This is necessary since conversion from Parquet to Spark could cause 
precision loss. For
+   * instance, Spark read schema is smallint/tinyint but Parquet only support 
int.
+   */
+  def convertParquetType(
+      parquetSchema: MessageType,
+      sparkReadSchema: Option[StructType] = None,
+      caseSensitive: Boolean = true): ParquetType = {
+    val column = new ColumnIOFactory().getColumnIO(parquetSchema)
+    convertInternal(column, sparkReadSchema, caseSensitive)
+  }
 
-        case REQUIRED =>
-          StructField(field.getName, convertField(field), nullable = false)
+  private def convertInternal(
+      groupColumn: GroupColumnIO,
+      sparkReadSchema: Option[StructType] = None,
+      caseSensitive: Boolean = true): ParquetType = {
+    val converted = (0 until groupColumn.getChildrenCount).map { i =>
+      val field = groupColumn.getChild(i)
+      var fieldReadType = sparkReadSchema.flatMap { schema =>
+        schema.find(f => isSameFieldName(f.name, field.getName, 
caseSensitive)).map(_.dataType)
+      }
+
+      // if a field is repeated here then it is neither contained by a `LIST` 
nor `MAP`

Review comment:
       nit. `if` -> `If`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34199: [SPARK-36935][SQL] Extend ParquetSchemaConverter to compute Parquet repetition & definition level

Reply via email to