Jonathan Vexler created HUDI-5871:
-------------------------------------

             Summary: Bootstrap does not work with partitions with /
                 Key: HUDI-5871
                 URL: https://issues.apache.org/jira/browse/HUDI-5871
             Project: Apache Hudi
          Issue Type: Bug
          Components: bootstrap, spark
            Reporter: Jonathan Vexler
         Attachments: scala_output_bootstrap1.txt

I have parquet data that I load into a dataframe and then save to a datatable 
by doing 

 
{code:java}
df.write.partitionBy("partition").parquet(tablePath) {code}
In the table, each partition is a directory labeled like partition=2022%2F1%2F25

 

I then do a bootstrap by doing

 
{code:scala}
import org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
import org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
import org.apache.hudi.{DataSourceWriteOptions, HoodieDataSourceHelpers}
import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieWriteConfig}
import org.apache.hudi.keygen.SimpleKeyGenerator
import org.apache.spark.sql.SaveModeimport org.apache.spark.sql.types._
val srcPath = 
"/Users/jon/Documents/bootstrap_testing/partitioned-parquet-table-fixed"
val basePath = "/Users/jon/Documents/bootstrap_testing/tables/test8"
val bootstrapDF = spark.emptyDataFramebootstrapDF.write
    .format("hudi")      
.option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")   
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)      
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key")      
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")      
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts")      
.option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, srcPath)      
.option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, 
classOf[SimpleKeyGenerator].getName)      
.option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR, 
classOf[BootstrapRegexModeSelector].getName)      
.option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX, "2022/1/2[4-8]")   
   .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX_MODE, 
"METADATA_ONLY")      
.option(HoodieBootstrapConfig.FULL_BOOTSTRAP_INPUT_PROVIDER, 
classOf[SparkParquetBootstrapDataProvider].getName) 
.mode(SaveMode.Overwrite)
.save(basePath)
{code}
that does not create any metadata_only because the regex is selecting on 
directory name, not partition_path, this should be clarified in the configs. I 
then change the regex to
{code:java}
partition=2022%2F1%2F2[4-8] {code}
This properly works, but there is an isssue,

Inside the hudi table, the directories are 
{code:java}
2022                    partition=2022%2F1%2F24 partition=2022%2F1%2F25 
partition=2022%2F1%2F26 partition=2022%2F1%2F27 partition=2022%2F1%2F28 {code}
The 2022 contains the FULL_BOOTSTRAP partitions but the METADATA_ONLY 
partitions are in those other directory. 

Maybe that is ok so I try to read from the hudi table. This file contains the 
output from my attempt: [^scala_output_bootstrap1.txt] 

I go back to my parquet table and make a copy and move the partitions into the 
hudi structure where 

2022->1->24

2022->1->25

...

2022-1->31

2022->2->1

....

is the directory structure. I change the regex back to how it was originally 
and run the bootstrap again. This time, the hudi directory contains 2022 which 
has the partitions that are METADATA_ONLY, but there is another directory 
__HIVE_DEFAULT_PARTITION that contains the FULL_BOOTSTRAP files. 

When I attempt to read from the hudi table I get 
{code:java}
scala> 
spark.read.format("hudi").load(basePath).createOrReplaceTempView("test_table")

scala> spark.sql("select * from test_table where 
_hoodie_partition_path=2022/1/29").count
23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key 
(__HIVE_DEFAULT_PARTITION__)
23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key 
(__HIVE_DEFAULT_PARTITION__)
res16: Long = 0

scala> spark.sql("select * from test_table where 
_hoodie_partition_path=2022/1/24").count
23/03/02 15:11:51 WARN HFileBootstrapIndex: No value found for partition key 
(__HIVE_DEFAULT_PARTITION__)
res17: Long = 0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to