WaterKnight1998 commented on issue #1777:
URL: https://github.com/apache/hudi/issues/1777#issuecomment-652597165


   > Ah okay, I think these are default values for the configs. You would need 
configure each of them based on table schema. Here is the config session that 
has explanation of these configs - 
https://hudi.apache.org/docs/configurations.html#PRECOMBINE_FIELD_OPT_KEY
   > https://hudi.apache.org/docs/configurations.html#RECORDKEY_FIELD_OPT_KEY
   > 
https://hudi.apache.org/docs/configurations.html#PARTITIONPATH_FIELD_OPT_KEY
   > 
   > I can help with these configs. You could chose a combination of 
`date,store,item` for record key to ensure uniqueness.
   > For precombine key, you need to chose a field that would help determine 
which is the latest record among two records with same record key.
   > For partition path, you would need to chose how to group you data. Here it 
could just be on date or a combination of date and store and more. This 
determines how your table data is partitioned. If you are interested in sales 
on a daily basis may be just date based partition would be good.
   > 
   > Please let me know if you have more questions.
   
   I make it work as follows:
   ```
   tableName = "forecast_evals"
   basePath = "gs://hudi-datalake/" + tableName
   
   hudi_options = {
     'hoodie.table.name': tableName,
     'hoodie.datasource.write.recordkey.field': 'key',
     'hoodie.datasource.write.table.name': tableName,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'training_date'
   }
   
   results = results.selectExpr(
                       "CONCAT('Store=',  store, ' Item=', item) as key",
                       "store",
                       "item",
                       "mae",
                       "mse",
                       "rmse",
                       "training_date")
   
   results.write.format("hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save(basePath)
   ```
   
   However, it runs very slow!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to