joao-miranda opened a new issue, #6601:
URL: https://github.com/apache/hudi/issues/6601

   **Describe the problem you faced**
   We are using Hudi in a Scala Glue Job. We need then crawl the data generated 
and get a table in the Glue Data Catalog. We need this for both partitioned and 
non-partitioned data.
   
   For partitioned data the output is as follows:
   ```
   .../database_name/table_name/partition=partition_key/<data files>
   ```
   
   After crawling we get a table in the data catalog with the correct name 
(table_name). That's the desired behavior.
   
   For non-partitioned data the output is as follows:
   ```
   .../database_name/table_name/<data files>
   ```
   
   The crawler then generates a table per each file. This is not what we want.
   
   We know the format we need for the crawler to work correctly: a default 
folder needs to exist before the data files:
   ```
   .../database_name/table_name/default/<data files>
   ```
   
   with the following structure for Hudi support files:
   ```
   .../database_name/table_name/.hoodie
   .../database_name/table_name/default/.hoodie_partition_metadata
   ```
   
   This was seemingly the behavior up to Hudi 0.9.0, but no longer reproduced 
from 0.10.0 onwards.
   
   Is there any configuration we could possibly be missing?
   
   
   **Steps to reproduce the behavior**
   **Dependencies:**
   ```
   "org.apache.hudi" %% "hudi-spark-bundle" % "2.12-0.10.0"
   "org.apache.hudi" %% "hudi-utilities-bundle" % "2.12-0.10.0"
   ```
   
   **Configuration used:**
   ```
   var hudiOptions = scala.collection.mutable.Map[String, String](
         HoodieWriteConfig.TABLE_NAME -> "hudiTableName",
         HoodieWriteConfig.COMBINE_BEFORE_INSERT.key() -> "true",
         DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
         DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
         DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKeyField",
         DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY ->  "ts",
         DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY -> 
classOf[AWSDmsAvroPayload].getName,
         DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> 
classOf[CustomKeyGenerator].getName,
         DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, ""
       )
   ```
   
   **Following options are added if a partition key is defined:**
   ```
         hudiOptions.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"partitionKeyField")
         
hudiOptions.put(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
         hudiOptions.put(HoodieIndexConfig.INDEX_TYPE.key(), "GLOBAL_BLOOM")
         
hudiOptions.put(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key(),
 "true")
         hudiOptions.put(DataSourceWriteOptions.DROP_PARTITION_COLUMNS.key(), 
"true")
   ```
   
   **Saved into a file:**
   ```
       // Write the DataFrame as a Hudi dataset
       mappedDF
         .dropDuplicates()
         .write
         .format("org.apache.hudi")
         .options(hudiOptions)
         .mode(SaveMode.Append)
         .save("targetDirectory")
   ```
   
   **Expected behavior**
   Output of Hudi is compatible with AWS Glue Crawler, with or without 
partitions.
   
   **Environment Description**
   
   - Hudi version : 0.10.0
   - Spark version : 3.1.1
   - Scala version: 2.12.15
   - AWS Glue version : 3.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to