[GitHub] [hudi] jiegzhan commented on issue #1980: [SUPPORT] Small files (423KB) generated after running delete query

GitBox Wed, 19 Aug 2020 09:58:12 -0700


jiegzhan commented on issue #1980:
URL: https://github.com/apache/hudi/issues/1980#issuecomment-676544068



   @bvaradar What is the size of new version of the same files after running 
delete query? For me, they are 423KB. 
   
   Step 1: ran bulk_insert query:
   ```
   df.
     write.format("org.apache.hudi").
     option("hoodie.datasource.write.operation", "bulk_insert").
     option("hoodie.bulkinsert.shuffle.parallelism", 5120).
     option("hoodie.parquet.max.file.size", 2000000000).
     option("hoodie.parquet.block.size", 2000000000).
     option("hoodie.parquet.small.file.limit", 512000000).
     option("hoodie.combine.before.insert", "false").
     option("hoodie.combine.before.upsert", "false").
     option(TABLE_NAME, tableName).
     option(TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE").
     option(RECORDKEY_FIELD_OPT_KEY, "device_id").
     option(PRECOMBINE_FIELD_OPT_KEY, "device_id").
     option(PARTITIONPATH_FIELD_OPT_KEY, "date_key").
     option(HIVE_SYNC_ENABLED_OPT_KEY, "true").
     option(HIVE_DATABASE_OPT_KEY, "default").
     option(HIVE_TABLE_OPT_KEY, "hudi_fact_device_logs").
     option(HIVE_USER_OPT_KEY, "hadoop").
     option(HIVE_PARTITION_FIELDS_OPT_KEY, "date_key").
     option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
classOf[MultiPartKeysValueExtractor].getName).
     mode(Append).
     save(basePath)
   ```
   **Got 5120 parquet files in S3, they are about 300MB - 700MB each.** 
   
   Step 2: ran delete query:
   ```
   val deleteExistingRecords = 
spark.read.format("org.apache.hudi").load(basePath + 
"/*/*").where(col("device_id").startsWith("D"))
   
   deleteExistingRecords.
     write.format("org.apache.hudi").
     option("hoodie.datasource.write.operation", "delete").
     option("hoodie.parquet.max.file.size", 2000000000).
     option("hoodie.parquet.block.size", 2000000000).
     option("hoodie.parquet.small.file.limit", 512000000).
     option(TABLE_NAME, tableName).
     option(TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE").
     option(RECORDKEY_FIELD_OPT_KEY, "device_id").
     option(PRECOMBINE_FIELD_OPT_KEY, "device_id").
     option(PARTITIONPATH_FIELD_OPT_KEY, "date_key").
     option(HIVE_SYNC_ENABLED_OPT_KEY, "true").
     option(HIVE_DATABASE_OPT_KEY, "default").
     option(HIVE_TABLE_OPT_KEY, "hudi_fact_device_logs").
     option(HIVE_USER_OPT_KEY, "hadoop").
     option(HIVE_PARTITION_FIELDS_OPT_KEY, "date_key").
     option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
classOf[MultiPartKeysValueExtractor].getName).
     mode(Append).
     save(basePath)
   ```
   **Got some newly generated small files (423KB) (see screenshot on the 
top).** 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jiegzhan commented on issue #1980: [SUPPORT] Small files (423KB) generated after running delete query

Reply via email to