[GitHub] [hudi] gunjdesai opened a new issue, #5542: [SUPPORT] Error Querying HUDI through Trino when syncing HMS via Spark

GitBox Mon, 09 May 2022 09:26:19 -0700


gunjdesai opened a new issue, #5542:
URL: https://github.com/apache/hudi/issues/5542


   **Describe the problem you faced**
   
   I am using HMS to sync my data via Spark and directly querying that data 
through Trino, but when i trying to run the command
   ```
   SELECT * FROM table_name LIMIT 10
   ```
   I get the following error
   ```
   Query 20220509_155204_00012_irn5c failed: Unable to create input format 
org.apache.hudi.hadoop.HoodieParquetInputFormat
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Spark Job for writing to S3 with HMS added
   2. `hudi-hadoop-mr-0.11.0.jar` bundle added in 
`<trino_install>/plugin/hive-hadoop2`
   3. Run the query in trino
   
   **Expected behavior**
   
   Ideally, the result should display 10 rows of data
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.1.1
   
   * Hive version : N/A
   
   * Hadoop version : N/A 
   
   * Storage (HDFS/S3/GCS..) : S3 via Minio
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   These are the config options passed to the Spark Structured Streaming Job
   ```
         df.writeStream
               .format(Format.HUDI)
               .option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE.key(), true)
               .option(HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key(), 
"updated_at")
               .option(DataSourceWriteOptions.TABLE_TYPE.key(), "COPY_ON_WRITE")
               .option(DataSourceWriteOptions.OPERATION.key(), upsert)
               
.option(DataSourceWriteOptions.STREAMING_RETRY_INTERVAL_MS.key(), 2000)
               .option(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key(), 
"view_id")
               .option(HoodieWriteConfig.TBL_NAME.key(), "question")
               .option(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key(), 
"created_date")
               .option(Options.CHECKPOINT_LOCATION_KEY, 
"s3a://warehouse/checkpoints/question")
               .option(Options.ARCHIVE_MIN_COMMITS_KEY, 3)
               .option(Options.HOODIE_METADATA_KEEP_MIN_COMMITS_KEY,  2)
               .option(Options.HOODIE_METADATA_KEEP_MAX_COMMITS_KEY, 4)
               .option(Options.HOODIE_EMBED_TIMELINE_SERVER_KEY, "false")
               .option(HoodieIndexConfig.INDEX_TYPE.key(), "SIMPLE")
               .option(HiveSyncConfig.HIVE_SYNC_MODE.key(), "hms")
               
.option(KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key(), "true")
               .option(HiveSyncConfig.METASTORE_URIS.key(), 
"thrift://hive-metastore.trino.svc.cluster.local:9083")
               .option(HoodieSyncConfig.META_SYNC_DATABASE_NAME.key(), 
"warehouse")
               .option(HoodieSyncConfig.META_SYNC_TABLE_NAME.key(), "question")
               .option(HoodieSyncConfig.META_SYNC_PARTITION_FIELDS.key(), 
"created_date")
               .option(HiveSyncConfig.HIVE_SYNC_ENABLED.key(), "true")
               .outputMode("append")
               .queryName("questions")
               .start("s3a://warehouse/transaction-db/questions")
   ```
   
   After querying the Metastore, this is the output i get after joining TBLS & 
DBS.
   
   ```
   6797 | org.apache.hudi.hadoop.HoodieParquetInputFormat               | 
s3a://warehouse/transaction-db/questions                                        
     |     6797
   ```
   
   **Stacktrace**
   
   ```Query 20220509_155204_00012_irn5c failed: Unable to create input format 
org.apache.hudi.hadoop.HoodieParquetInputFormat```
   
   
   I have followed the https://hudi.apache.org/docs/syncing_metastore/ doc to 
setup HMS Sync. Our setup doesn't contain Hive or Hadoop
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] gunjdesai opened a new issue, #5542: [SUPPORT] Error Querying HUDI through Trino when syncing HMS via Spark

Reply via email to