[GitHub] [hudi] nsivabalan commented on issue #9282: [ISSUE] Hudi 0.13.0. Spark 3.3.2 Deltastreamed table read failure

via GitHub Thu, 03 Aug 2023 06:50:25 -0700


nsivabalan commented on issue #9282:
URL: https://github.com/apache/hudi/issues/9282#issuecomment-1664021257


   I have some hunch on where the issue could be. 
   The sql conf defaults are actually set in ParquetFileFormat 
https://github.com/apache/spark/blob/v3.3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
   Above referenced one is directly from spark repo. Code snippet of interest:
   ```
    conf.set(
         SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.key,
         sparkSession.sessionState.conf.legacyParquetNanosAsLong.toString)
   ```
   
   But in Hudi, we have overridden the fileformat. 
   
https://github.com/apache/hudi/blob/2d779fb5aa1ebfd33676ebf29217f25c60e17d12/hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark33HoodieParquetFileFormat.scala
   
   ```
       // Using string value of this conf to preserve compatibility across 
spark versions.
       hadoopConf.setBoolean(
         "spark.sql.legacy.parquet.nanosAsLong",
         
sparkSession.sessionState.conf.getConfString("spark.sql.legacy.parquet.nanosAsLong",
 "false").toBoolean
       )
   ```
   
   This is slightly different from how other similar configs are set 
   ```
   hadoopConf.setBoolean(
         SQLConf.PARQUET_BINARY_AS_STRING.key,
         sparkSession.sessionState.conf.isParquetBinaryAsString)
       hadoopConf.setBoolean(
         SQLConf.PARQUET_INT96_AS_TIMESTAMP.key,
         sparkSession.sessionState.conf.isParquetINT96AsTimestamp)
   ```
   
   @amrishlal : can you dive in more in this direction 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #9282: [ISSUE] Hudi 0.13.0. Spark 3.3.2 Deltastreamed table read failure

Reply via email to