structured streaming join of streaming dataframe with static dataframe performance

Koert Kuipers Sun, 17 Jul 2022 17:38:24 -0700

i was surprised to find out that if a streaming dataframe is joined with a
static dataframe, that the static dataframe is re-shuffled for every
microbatch, which adds considerable overhead.


wouldn't it make more sense to re-use the shuffle files?

or if that is not possible then load the static dataframe into the
statestore? this would turn the join into a lookup (in rocksdb)?

-- 
CONFIDENTIALITY NOTICE: This electronic communication and any files 
transmitted with it are confidential, privileged and intended solely for 
the use of the individual or entity to whom they are addressed. If you are 
not the intended recipient, you are hereby notified that any disclosure, 
copying, distribution (electronic or otherwise) or forwarding of, or the 
taking of any action in reliance on the contents of this transmission is 
strictly prohibited. Please notify the sender immediately by e-mail if you 
have received this email by mistake and delete this email from your system.


Is it necessary to print this email? If you care about the environment 
like we do, please refrain from printing emails. It helps to keep the 
environment forested and litter-free.

structured streaming join of streaming dataframe with static dataframe performance

Reply via email to