[GitHub] [hudi] esaeki opened a new issue #4174: cannot delete data MOR mode

GitBox Tue, 30 Nov 2021 22:34:58 -0800


esaeki opened a new issue #4174:
URL: https://github.com/apache/hudi/issues/4174



   I would like to delete some data in S3 by using Glue and Hudi, but 
encountered the following error. 
   
   ```
   An error occurred while calling 0128.save. Illegal character in: 
tableName_record
   ```
   
   The format of data in S3 should be fine. Does anyone know the cause of this?
   
   Here is the script of my Glue job.
   
   ```
   import sys
   from awsglue.utils import getResolvedOptions
   from pyspark.context import SparkContext
   from pyspark.sql.session import SparkSession
   from awsglue.context import GlueContext
   from awsglue.job import Job
   
   ## @params: [JOB_NAME]
   args = getResolvedOptions(sys.argv, ['JOB_NAME'])
   
   spark = 
SparkSession.builder.config('spark.serializer','org.apache.spark.serializer.KryoSerializer').getOrCreate()
   sc = spark.sparkContext
   glueContext = GlueContext(sc)
   job = Job(glueContext)
   job.init(args['JOB_NAME'], args)
   
   tableName = 'hudi_sample_mor'
   bucketName = 'cm-hudi-sample--datalake'
   basePath = f's3://{bucketName}/{tableName}'
   
   schema = ["time", "transaction_id", "option"]
   data = [
       ("2020/12/18", "00001", "A"),
       ("2020/12/18", "00002", "A"),
       ("2020/12/19", "00003", "A"),
       ("2020/12/19", "00004", "A"),
       ("2020/12/20", "00005", "A"),
       ("2020/12/20", "00006", "A"),
   ]
   df = spark.createDataFrame(data, schema)
   
   hudi_options = {
     'hoodie.table.name': tableName, 
     'hoodie.datasource.write.storage.type': 'MERGE_ON_READ', 
     'hoodie.compact.inline': True, 
     'hoodie.compact.inline.max.delta.commits': 20,
     'hoodie.parquet.small.file.limit': 0,
     'hoodie.datasource.write.recordkey.field': 'transaction_id', 
     'hoodie.datasource.write.partitionpath.field': 'time', 
     'hoodie.datasource.write.table.name': tableName, 
     'hoodie.datasource.write.operation': 'insert', 
     'hoodie.datasource.write.precombine.field': 'option', 
     'hoodie.upsert.shuffle.parallelism': 2,  
     'hoodie.insert.shuffle.parallelism': 2, 
   }
   
   df.write.format("hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save(basePath)
   
   job.commit()
   
   schema = ["time", "transaction_id", "option"]
   data_delete = [
       ("2020/12/20", "00005", "A"),
   ]
   df_delete = spark.createDataFrame(data_delete, schema)
   
   hudi_options = {
     'hoodie.table.name': tableName,
     'hoodie.datasource.write.storage.type': 'MERGE_ON_READ', 
     'hoodie.compact.inline': True, 
     'hoodie.compact.inline.max.delta.commits': 20,
     'hoodie.parquet.small.file.limit': 0,
     'hoodie.datasource.write.recordkey.field': 'transaction_id', 
     'hoodie.datasource.write.partitionpath.field': 'time', 
     'hoodie.datasource.write.table.name': tableName, 
     'hoodie.datasource.write.operation': 'delete', 
     'hoodie.datasource.write.precombine.field': 'option', 
     'hoodie.upsert.shuffle.parallelism': 2,  
     'hoodie.insert.shuffle.parallelism': 2, 
   }
   
   df_delete.write.format("hudi"). \
     options(**hudi_options). \
     mode("append"). \
     save(basePath)
   
   job.commit()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] esaeki opened a new issue #4174: cannot delete data MOR mode

Reply via email to