rubenssoto opened a new issue #1839: URL: https://github.com/apache/hudi/issues/1839
Hi guys, how are you? I have some use cases that I want to read using structured streaming from a hudi dataset and write to another grouped hudi dataset. In a real world example, I have a raw zone in my datalake, and want to streaming from raw zone to curated zone, but in sometimes my curated hudi dataset is grouped. Spark streaming don't work with hudi datasets sources, so to this use case works I need to treat hudi dataset like a normal parquet dataset, but hudi rewrite data every time and the new file has the old data plus new data, if my sink isn't grouped, it's only a deduplication problem but my sink is grouped so it isn't gonna work. I don't have guarantee that all my grouped data is in the new file that hudi writes. I use pyspark to write my streaming jobs, its easier for my team, o I think that delta streamer is not an option. Do you have some idea how to solve this? And you have plans to support hudi dataset to a spark streaming source? Delta Lake has solved this problem with ignoreChanges option https://docs.databricks.com/delta/delta-streaming.html ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
