[GitHub] [hudi] rubenssoto opened a new issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

GitBox Thu, 16 Jul 2020 09:40:25 -0700


rubenssoto opened a new issue #1839:
URL: https://github.com/apache/hudi/issues/1839



   Hi guys, how are you?
   
   I have some use cases that I want to read using structured streaming from a 
hudi dataset and write to another grouped hudi dataset. In a real world 
example, I have a raw zone in my datalake, and want to streaming from raw zone 
to curated zone, but in sometimes my curated hudi dataset is grouped.
   
   Spark streaming don't work with hudi datasets sources, so to this use case 
works I need to treat hudi dataset like a normal parquet dataset, but hudi 
rewrite data every time and the new file has the old data plus new data, if my 
sink isn't grouped, it's only a deduplication problem but my sink is grouped so 
it isn't gonna work.
   
   I don't have guarantee that all my grouped data is in the new file that hudi 
writes.
   
   I use pyspark to write my streaming jobs, its easier for my team, o I think 
that delta streamer is not an option.
   
   Do you have some idea how to solve this? And you have plans to support hudi 
dataset to a spark streaming source?
   
   Delta Lake has solved this problem with ignoreChanges option
   https://docs.databricks.com/delta/delta-streaming.html


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rubenssoto opened a new issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

Reply via email to