codope commented on issue #2439:
URL: https://github.com/apache/hudi/issues/2439#issuecomment-930059601


   @rakeshramakrishnan For hive sync to work inline through Hudi, the 
hive-site.xml at <hive_install_dir>/conf should also be placed under 
<spark_install_dir>/conf and it should have the correct metastore uri. Can you 
check the hive metastore uri in hive-site.xml inside <spark_install_dir>/conf? 
   
   I tried to reproduce with a remote MySQL database as metastore. My jdbc 
specific configs in hive-site.xml look like as follows:
   ```
   "javax.jdo.option.ConnectionURL": 
"jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true",
   "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
   "javax.jdo.option.ConnectionUserName": "username",
   "javax.jdo.option.ConnectionPassword": "password"  
   ```
   
   Then the following pyspark script works:
   ```
   pyspark \
   >  --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
   >  --conf "spark.sql.hive.convertMetastoreParquet=false" \
   >  --jars 
/home/hadoop/hudi-spark3-bundle_2.12-0.10.0-SNAPSHOT.jar,/usr/lib/spark/external/lib/spark-avro.jar
   ...
   ...
   Using Python version 3.7.9 (default, Aug 27 2020 21:59:41)
   SparkSession available as 'spark'.
   >>> from pyspark.sql import functions as F
   >>>
   >>> inputDF = spark.createDataFrame([
   ...  ("100", "2015/01/01", "2015-01-01T13:51:39.340396Z"),
   ...  ("101", "2015/01/01", "2015-01-01T12:14:58.597216Z"),
   ...  ("102", "2015/01/01", "2015-01-01T13:51:40.417052Z"),
   ...  ("103", "2015/01/01", "2015-01-01T13:51:40.519832Z"),
   ...  ("104", "2015/01/02", "2015-01-01T12:15:00.512679Z"),
   ...  ("105", "2015/01/02", "2015-01-01T13:51:42.248818Z")],
   ...  ["id", "creation_date", "last_update_time"])
   >>>
   >>> hudiOptions = {
   ...   "hoodie.table.name" : "hudi_hive_table",
   ...   "hoodie.datasource.write.table.type" : "COPY_ON_WRITE",
   ...   "hoodie.datasource.write.operation" : "insert",
   ...   "hoodie.datasource.write.recordkey.field" : "id",
   ...   "hoodie.datasource.write.partitionpath.field" : "creation_date",
   ...   "hoodie.datasource.write.precombine.field" : "last_update_time",
   ...   "hoodie.datasource.hive_sync.enable" : "true",
   ...   "hoodie.datasource.hive_sync.table" : "hudi_hive_table",
   ...   "hoodie.datasource.hive_sync.partition_fields" : "creation_date"
   ... }
   >>>
   >>> 
inputDF.write.format("org.apache.hudi").options(**hudiOptions).mode("overwrite").save("s3://huditestbkt/hive_sync/")
   21/09/29 10:22:08 WARN HoodieSparkSqlWriter$: hoodie table at 
s3://huditestbkt/hive_sync already exists. Deleting existing data & overwriting 
with new data.
   21/09/29 10:22:34 WARN HiveConf: HiveConf of name hive.server2.thrift.url 
does not exist
   >>>
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to