HeartSaVioR edited a comment on issue #23634: [SPARK-26154][SS] Streaming 
left/right outer join should not return outer nulls for already matched rows
URL: https://github.com/apache/spark/pull/23634#issuecomment-464613135
 
 
   @tdas 
   I may need to get some numbers to back up my idea, but let me explain the 
rationalization first.
   
   Lesson learned from my previous work #21733 was reducing the size of diff on 
state per batch performs better (size and time) in spite of needs on additional 
projection. I considered both approaches: 1) add boolean flag to current index 
to row 2) add a new state store to only store boolean flag. If we compare both 
approach via state size, we can expect below:
   
   * approach 1) requires change of state by `value + boolean flag` (key 
doesn't need to be stored again)
   * approach 2) requires change of state by `key + boolean flag` (value 
doesn't need to be stored again)
   
   Given that we store the row as it is for value part, most of the times 
`value + boolean flag` would be bigger than `key + boolean flag` (since value 
may also have part or full of key) which would make me think we can take 
approach 2) to gain state optimization with adding some complexity of state 
codebase. 
   (Having one state store has another non-trivial overhead so I would not say 
approach 2) is 100% superior to approach 1). That may need to be explored if we 
would like to see the numbers.)
   
   Suppose we take approach 2), refactor of codebase is necessary to reduce 
huge code duplication: current implementation doesn't seem to have 
extensibility - and the change refactors the code to try best to reduce code 
duplication whereas same code can be used to two places.
   
   I would be happy to take approach 1) in this PR and experiment about 
approach 2) later if we doubt about its benefit. Please let me know which one 
we would prefer. Thanks in advance!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to