HeartSaVioR edited a comment on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows URL: https://github.com/apache/spark/pull/23634#issuecomment-464613135 @tdas I may need to get some numbers to back up my idea, but let me explain the rationalization first. Lesson learned from my previous work #21733 was reducing the size of diff on state per batch performs better (size and time) in spite of needs on additional projection. I considered both approaches: 1) add boolean flag to current index to row 2) add a new state store to only store boolean flag. If we compare both approach via state size, we can expect below: * approach 1) requires change of state by `value + boolean flag` (key doesn't need to be stored again) * approach 2) requires change of state by `key + boolean flag` (value doesn't need to be stored again) Given that we store the row as it is for value part, most of the times `value + boolean flag` would be bigger than `key + boolean flag` (since value may also have part or full of key) which would make me think we can take approach 2) to gain state optimization with adding some complexity of state codebase. (Having one state store has another non-trivial overhead so I would not say approach 2) is 100% superior to approach 1). That may need to be explored if we would like to see the numbers.) Suppose we take approach 2), refactor of codebase is necessary to reduce huge code duplication: current implementation doesn't seem to have extensibility - and the change refactors the code to try best to reduce code duplication whereas same code can be used to two places. I would be happy to take approach 1) in this PR and experiment about approach 2) later if we doubt about its benefit. Please let me know which one we would prefer. Thanks in advance!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
