[GitHub] [hudi] mithalee commented on issue #3336: [SUPPORT] Delete not functioning with deltastreamer

GitBox Tue, 03 Aug 2021 12:39:43 -0700


mithalee commented on issue #3336:
URL: https://github.com/apache/hudi/issues/3336#issuecomment-892112742



   @codope Hi, I did try the HoodieDeltaStreamer on below version of EMR:
   Release label:emr-6.3.0
   Hadoop distribution:Amazon 3.2.1
   Applications:Tez 0.9.2, Spark 3.1.1, Hive 3.1.2, JupyterHub 1.2.0, Sqoop 
1.4.7, Zeppelin 0.9.0, Hue 4.9.0, Presto 0.245.1
   HUDI Utility: 
https://mvnrepository.com/artifact/org.apache.hudi/hudi-utilities-bundle_2.12/0.8.0
   The HoodieDeltaStreamer works as expected(upserts/deletes).  But the same 
data set and same spark submit configuration does not work on the Spark on K8 
binaries.
   https://spark.apache.org/docs/3.1.1/submitting-applications.html
   I can perform the initial insert into Hudi table through the below spark 
submit but the upserts/deletes are throwing error. Can you confirm if the 
upserts /deletes works using the DS in Spark on K8s.
   
   **SPARK SUBMIT**
   spark-submit --master k8s://https://...sk1.us-west-1.eks.amazonaws.com 
   --deploy-mode cluster 
   --name spark-hudi 
   --jars 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.executor.instances=1 
   --conf spark.kubernetes.container.image=d.../spark:spark-hudi-0.2 
   --conf spark.kubernetes.namespace=spark-k8 
   --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret 
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
   --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com 
   --conf spark.hadoop.fs.s3a.access.key='A........' 
   --conf spark.hadoop.fs.s3a.secret.key='L.........' 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   s3a://l...v/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar 
   --table-type COPY_ON_WRITE --source-ordering-field one 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --target-base-path 
s3a://......./hudi-root/transformed-tables/hudi_writer_mm7/ 
   --target-table test_table --base-file-format PARQUET 
   --hoodie-conf hoodie.datasource.write.recordkey.field=one 
   --hoodie-conf hoodie.datasource.write.partitionpath.field=two 
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://....../jen/example_4.parquet  
   
   **Parquet file example_4.parquet  :**
   import pyarrow.parquet as pq
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import uuid
   
   df = pd.DataFrame({'one': [-1, 3, 2.5],
   'two': [100, 200, 300],
   'three': [True, True, True],
   '_hoodie_is_deleted': [False, False, False]},
   index=list('abc'))
   table = pa.Table.from_pandas(df)
   print(table)
   pq.write_table(table, 'example_4.parquet')
   
   **Delete file:**
   import pyarrow.parquet as pq
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import uuid
   df = pd.DataFrame({'one': [-1, 3, 2.5],
   'two': [100, 200, 300],
   'three': [True, True, True],
   '_hoodie_is_deleted': [True, False, False]},
   index=list('abc'))
   table = pa.Table.from_pandas(df)
   print(table)
   pq.write_table(table, 'example_delete4.parquet')


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] mithalee commented on issue #3336: [SUPPORT] Delete not functioning with deltastreamer

Reply via email to