[
https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
tao meng updated HUDI-2214:
---------------------------
Description:
residual temporary files after clustering are not cleaned up
// test step
step1: do clustering
val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
val inputDF1: Dataset[Row] =
spark.read.json(spark.sparkContext.parallelize(records1, 2))
inputDF1.write.format("org.apache.hudi")
.options(commonOpts)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(),
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(),
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
// option for clustering
.option("hoodie.parquet.small.file.limit", "0")
.option("hoodie.clustering.inline", "true")
.option("hoodie.clustering.inline.max.commits", "1")
.option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824")
.option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
.option("hoodie.clustering.plan.strategy.max.bytes.per.group",
Long.MaxValue.toString)
.option("hoodie.clustering.plan.strategy.target.file.max.bytes",
String.valueOf(12 *1024 * 1024L))
.option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, begin_lon")
.mode(SaveMode.Overwrite)
.save(basePath)
step2: check the temp dir, we find
/tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
{color:#FF0000}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
{color}
is not cleaned up.
was:
residual temporary files after clustering are not cleaned up
> residual temporary files after clustering are not cleaned up
> ------------------------------------------------------------
>
> Key: HUDI-2214
> URL: https://issues.apache.org/jira/browse/HUDI-2214
> Project: Apache Hudi
> Issue Type: Bug
> Components: Cleaner
> Affects Versions: 0.8.0
> Environment: spark3.1.1
> hadoop3.1.1
> Reporter: tao meng
> Assignee: tao meng
> Priority: Major
> Fix For: 0.10.0
>
>
> residual temporary files after clustering are not cleaned up
> // test step
> step1: do clustering
> val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
> val inputDF1: Dataset[Row] =
> spark.read.json(spark.sparkContext.parallelize(records1, 2))
> inputDF1.write.format("org.apache.hudi")
> .options(commonOpts)
> .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(),
> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
> .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(),
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
> // option for clustering
> .option("hoodie.parquet.small.file.limit", "0")
> .option("hoodie.clustering.inline", "true")
> .option("hoodie.clustering.inline.max.commits", "1")
> .option("hoodie.clustering.plan.strategy.target.file.max.bytes",
> "1073741824")
> .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
> .option("hoodie.clustering.plan.strategy.max.bytes.per.group",
> Long.MaxValue.toString)
> .option("hoodie.clustering.plan.strategy.target.file.max.bytes",
> String.valueOf(12 *1024 * 1024L))
> .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat,
> begin_lon")
> .mode(SaveMode.Overwrite)
> .save(basePath)
> step2: check the temp dir, we find
> /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
> {color:#FF0000}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
> {color}
> is not cleaned up.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)