I recently added more test results to SPARK-24763 [1] which shows that the
proposal reduces state size according to the ratio of key-value size,
whereas there's no performance hit and sometimes even slight boost.

Please refer the latest comment in JIRA issue [2] to see the numbers from
perf. tests.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24763
2.
https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367


2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <kabh...@gmail.com>님이 작성:

> Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
> option to enable optimization of state size in streaming aggregation
> without hurting performance.
>
> The idea is to remove data for key fields from value which is duplicated
> between key and value in state row. This requires additional operations
> like projection and join, but smaller state row would also give performance
> benefit, which can offset each other.
>
> Please refer the comment in JIRA issue [2] to see the numbers from simple
> perf. test.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24763
>
>
> 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>
>> Ted Yu suggested posting the improved numbers to this thread and I think
>> it's good idea, so also posting here, but I also think explaining
>> rationalization of my issues would help understanding why I'm submitting
>> couple of patches, so I'll explain it first. (Sorry to post a wall of text).
>>
>> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
>> store provider from 10x~80x of size of state for a batch according to
>> various stateful workloads to less than or around 2x. The new option is
>> flexible so it can be even around 1x or even effectively disable cache.
>> Please refer the comment in the PR [2] to get more details. (hard to post
>> detailed numbers in mail format so link a Github comment instead)
>>
>> I have interest on stateful streaming processing on Structured Streaming,
>> and have been learning from codebase as well as analyzing memory usage as
>> well as latency (while I admit it is hard to measure latency correctly...).
>>
>>
>> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>>
>> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
>> excessive caching. As I described in section "The impact of storing
>> multiple versions from HDFSBackedStateStoreProvider" in above link, while
>> multiple versions share the same UnsafeRow unless there's a change on the
>> value which lessen the impact of caching multiple versions (credit to Jose
>> Torres since I realized it from his comment). But in some workloads which
>> lots of writes to state incurs in a batch, the overall memory usage of
>> state is going to be out of expectation.
>>
>> Related patch [3] is also submitted from other contributor (so I'm not
>> the one to notice this behavior), whereas the patch might not look enough
>> generalized to apply.
>>
>> First I decided to track the overall memory size of state provider cache
>> and expose to UI as well as query status (SPARK-24441 [4]). The metric
>> looked like critical and worth to monitor, so I thought it is better to
>> expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>>
>> Based on adoption of SPARK-24441, I could find more flexible way to
>> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>>
>> So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
>> that it helps explaining worth of reviews for these patches.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24717
>> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
>> 3. https://github.com/apache/spark/pull/21500
>> 4. https://issues.apache.org/jira/browse/SPARK-24441
>> 5. https://issues.apache.org/jira/browse/SPARK-24637
>>
>> ps. Before all mentioned issues I also submitted some other issues
>> regarding feature addition/refactor (2 of 5 issues).
>>
>>
>> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>>
>>> Bump. I have been having hard time working on making additional PRs
>>> since some of these rely on non-merged PRs, so spending additional time to
>>> decouple these things if possible.
>>>
>>> https://github.com/apache/spark/pulls/HeartSaVioR
>>>
>>> Pending 5 PRs so far, and may add more sooner or later.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>>>
>>>> Kindly reminder since around 2 weeks passed. I've added more PR during
>>>> 2 weeks and even planning to do more.
>>>>
>>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>>>>
>>>>> Hi Spark devs,
>>>>>
>>>>> I have couple of pull requests for structured streaming which are
>>>>> getting older and fading out from earlier pages in PR pages.
>>>>>
>>>>> https://github.com/apache/spark/pull/21469
>>>>> https://github.com/apache/spark/pull/21357
>>>>> https://github.com/apache/spark/pull/21222
>>>>>
>>>>> Two of them are in a kind of approval by couple of folks, but no
>>>>> approval from committers yet.
>>>>> One of them needs rebase and I would be happy to do it after reviewing
>>>>> or in progress of reviewing.
>>>>>
>>>>> Getting reviewed in time would be critical for contributors to be
>>>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>>>
>>>>> Thanks in advance,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>

Reply via email to