[GitHub] [hudi] mansipp commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to v2

via GitHub Thu, 03 Aug 2023 14:07:59 -0700


mansipp commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664649658


   Manually tested s3a path using EMR cluster. 
   
   ```scala
   spark-shell \
   --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
   --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 \
   --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
   --conf "spark.sql.hive.convertMetastoreParquet=false" \
   --jars 
/usr/lib/hudi/hudi-aws-bundle-0.13.1-amzn-1-SNAPSHOT.jar,/usr/lib/hudi/hudi-spark3-bundle_2.12-0.13.1-amzn-1-SNAPSHOT.jar
   ```
   
   ```scala
   import org.apache.spark.sql.SaveMode
   import org.apache.spark.sql.functions._
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.DataSourceReadOptions
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   import org.apache.hudi.hive.HiveSyncConfig
   import org.apache.hudi.sync.common.HoodieSyncConfig
   
   // Create a DataFrame
   var tableName = "mansi_s3a_hudi_test"
   var tablePath = "s3a://<myBucket>/tables/" + tableName
   val inputDF = Seq(
    ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
    ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
    ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
    ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
    ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
    ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
    ).toDF("id", "creation_date", "last_update_time")
   
   //Specify common DataSourceWriteOptions in the single hudiOptions variable 
   val hudiOptions = Map[String,String](
     HoodieWriteConfig.TABLE_NAME -> tableName,
     DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", 
     DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
     DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date",
     DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time",
     DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
     DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
     DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date",
     DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName
   )
   
   // Write the DataFrame as a Hudi dataset
   (inputDF.write
       .format("org.apache.hudi")
       .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
       .options(hudiOptions)
       .mode(SaveMode.Overwrite)
       .save(tablePath))
   
   ``` 
   
![Image](https://github.com/apache/hudi/assets/107222979/672d5057-cbe5-43f7-b927-838510b1220c)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] mansipp commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to v2

Reply via email to