Jonathan Vexler created HUDI-5871:
-------------------------------------
Summary: Bootstrap does not work with partitions with /
Key: HUDI-5871
URL: https://issues.apache.org/jira/browse/HUDI-5871
Project: Apache Hudi
Issue Type: Bug
Components: bootstrap, spark
Reporter: Jonathan Vexler
Attachments: scala_output_bootstrap1.txt
I have parquet data that I load into a dataframe and then save to a datatable
by doing
{code:java}
df.write.partitionBy("partition").parquet(tablePath) {code}
In the table, each partition is a directory labeled like partition=2022%2F1%2F25
I then do a bootstrap by doing
{code:scala}
import org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
import org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
import org.apache.hudi.{DataSourceWriteOptions, HoodieDataSourceHelpers}
import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieWriteConfig}
import org.apache.hudi.keygen.SimpleKeyGenerator
import org.apache.spark.sql.SaveModeimport org.apache.spark.sql.types._
val srcPath =
"/Users/jon/Documents/bootstrap_testing/partitioned-parquet-table-fixed"
val basePath = "/Users/jon/Documents/bootstrap_testing/tables/test8"
val bootstrapDF = spark.emptyDataFramebootstrapDF.write
.format("hudi")
.option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY,
DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts")
.option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, srcPath)
.option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS,
classOf[SimpleKeyGenerator].getName)
.option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR,
classOf[BootstrapRegexModeSelector].getName)
.option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX, "2022/1/2[4-8]")
.option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX_MODE,
"METADATA_ONLY")
.option(HoodieBootstrapConfig.FULL_BOOTSTRAP_INPUT_PROVIDER,
classOf[SparkParquetBootstrapDataProvider].getName)
.mode(SaveMode.Overwrite)
.save(basePath)
{code}
that does not create any metadata_only because the regex is selecting on
directory name, not partition_path, this should be clarified in the configs. I
then change the regex to
{code:java}
partition=2022%2F1%2F2[4-8] {code}
This properly works, but there is an isssue,
Inside the hudi table, the directories are
{code:java}
2022 partition=2022%2F1%2F24 partition=2022%2F1%2F25
partition=2022%2F1%2F26 partition=2022%2F1%2F27 partition=2022%2F1%2F28 {code}
The 2022 contains the FULL_BOOTSTRAP partitions but the METADATA_ONLY
partitions are in those other directory.
Maybe that is ok so I try to read from the hudi table. This file contains the
output from my attempt: [^scala_output_bootstrap1.txt]
I go back to my parquet table and make a copy and move the partitions into the
hudi structure where
2022->1->24
2022->1->25
...
2022-1->31
2022->2->1
....
is the directory structure. I change the regex back to how it was originally
and run the bootstrap again. This time, the hudi directory contains 2022 which
has the partitions that are METADATA_ONLY, but there is another directory
__HIVE_DEFAULT_PARTITION that contains the FULL_BOOTSTRAP files.
When I attempt to read from the hudi table I get
{code:java}
scala>
spark.read.format("hudi").load(basePath).createOrReplaceTempView("test_table")
scala> spark.sql("select * from test_table where
_hoodie_partition_path=2022/1/29").count
23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key
(__HIVE_DEFAULT_PARTITION__)
23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key
(__HIVE_DEFAULT_PARTITION__)
res16: Long = 0
scala> spark.sql("select * from test_table where
_hoodie_partition_path=2022/1/24").count
23/03/02 15:11:51 WARN HFileBootstrapIndex: No value found for partition key
(__HIVE_DEFAULT_PARTITION__)
res17: Long = 0 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)