I recently added more test results to SPARK-24763 [1] which shows that the proposal reduces state size according to the ratio of key-value size, whereas there's no performance hit and sometimes even slight boost.
Please refer the latest comment in JIRA issue [2] to see the numbers from perf. tests. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://issues.apache.org/jira/browse/SPARK-24763 2. https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367 2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <kabh...@gmail.com>님이 작성: > Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new > option to enable optimization of state size in streaming aggregation > without hurting performance. > > The idea is to remove data for key fields from value which is duplicated > between key and value in state row. This requires additional operations > like projection and join, but smaller state row would also give performance > benefit, which can offset each other. > > Please refer the comment in JIRA issue [2] to see the numbers from simple > perf. test. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 1. https://issues.apache.org/jira/browse/SPARK-24763 > > > 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <kabh...@gmail.com>님이 작성: > >> Ted Yu suggested posting the improved numbers to this thread and I think >> it's good idea, so also posting here, but I also think explaining >> rationalization of my issues would help understanding why I'm submitting >> couple of patches, so I'll explain it first. (Sorry to post a wall of text). >> >> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state >> store provider from 10x~80x of size of state for a batch according to >> various stateful workloads to less than or around 2x. The new option is >> flexible so it can be even around 1x or even effectively disable cache. >> Please refer the comment in the PR [2] to get more details. (hard to post >> detailed numbers in mail format so link a Github comment instead) >> >> I have interest on stateful streaming processing on Structured Streaming, >> and have been learning from codebase as well as analyzing memory usage as >> well as latency (while I admit it is hard to measure latency correctly...). >> >> >> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html >> >> While took a look at HDFSBackedStateStoreProvider I indicated a kind of >> excessive caching. As I described in section "The impact of storing >> multiple versions from HDFSBackedStateStoreProvider" in above link, while >> multiple versions share the same UnsafeRow unless there's a change on the >> value which lessen the impact of caching multiple versions (credit to Jose >> Torres since I realized it from his comment). But in some workloads which >> lots of writes to state incurs in a batch, the overall memory usage of >> state is going to be out of expectation. >> >> Related patch [3] is also submitted from other contributor (so I'm not >> the one to notice this behavior), whereas the patch might not look enough >> generalized to apply. >> >> First I decided to track the overall memory size of state provider cache >> and expose to UI as well as query status (SPARK-24441 [4]). The metric >> looked like critical and worth to monitor, so I thought it is better to >> expose it (and watermark) to Dropwizard (SPARK-24637 [5]). >> >> Based on adoption of SPARK-24441, I could find more flexible way to >> resolve the issue (SPARK-24717) what I've mentioned in tl;dr. >> >> So 3 of 5 issues are coupled so far to track and resolve one issue. Hope >> that it helps explaining worth of reviews for these patches. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> 1. https://issues.apache.org/jira/browse/SPARK-24717 >> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576 >> 3. https://github.com/apache/spark/pull/21500 >> 4. https://issues.apache.org/jira/browse/SPARK-24441 >> 5. https://issues.apache.org/jira/browse/SPARK-24637 >> >> ps. Before all mentioned issues I also submitted some other issues >> regarding feature addition/refactor (2 of 5 issues). >> >> >> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <kabh...@gmail.com>님이 작성: >> >>> Bump. I have been having hard time working on making additional PRs >>> since some of these rely on non-merged PRs, so spending additional time to >>> decouple these things if possible. >>> >>> https://github.com/apache/spark/pulls/HeartSaVioR >>> >>> Pending 5 PRs so far, and may add more sooner or later. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <kabh...@gmail.com>님이 작성: >>> >>>> Kindly reminder since around 2 weeks passed. I've added more PR during >>>> 2 weeks and even planning to do more. >>>> >>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <kabh...@gmail.com>님이 작성: >>>> >>>>> Hi Spark devs, >>>>> >>>>> I have couple of pull requests for structured streaming which are >>>>> getting older and fading out from earlier pages in PR pages. >>>>> >>>>> https://github.com/apache/spark/pull/21469 >>>>> https://github.com/apache/spark/pull/21357 >>>>> https://github.com/apache/spark/pull/21222 >>>>> >>>>> Two of them are in a kind of approval by couple of folks, but no >>>>> approval from committers yet. >>>>> One of them needs rebase and I would be happy to do it after reviewing >>>>> or in progress of reviewing. >>>>> >>>>> Getting reviewed in time would be critical for contributors to be >>>>> honest, so I'd like to ask dev mailing list to review my PRs. >>>>> >>>>> Thanks in advance, >>>>> Jungtaek Lim (HeartSaVioR) >>>>> >>>>