Ted Yu suggested posting the improved numbers to this thread and I think
it's good idea, so also posting here, but I also think explaining
rationalization of my issues would help understanding why I'm submitting
couple of patches, so I'll explain it first. (Sorry to post a wall of text).

tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
store provider from 10x~80x of size of state for a batch according to
various stateful workloads to less than or around 2x. The new option is
flexible so it can be even around 1x or even effectively disable cache.
Please refer the comment in the PR [2] to get more details. (hard to post
detailed numbers in mail format so link a Github comment instead)

I have interest on stateful streaming processing on Structured Streaming,
and have been learning from codebase as well as analyzing memory usage as
well as latency (while I admit it is hard to measure latency correctly...).

https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html

While took a look at HDFSBackedStateStoreProvider I indicated a kind of
excessive caching. As I described in section "The impact of storing
multiple versions from HDFSBackedStateStoreProvider" in above link, while
multiple versions share the same UnsafeRow unless there's a change on the
value which lessen the impact of caching multiple versions (credit to Jose
Torres since I realized it from his comment). But in some workloads which
lots of writes to state incurs in a batch, the overall memory usage of
state is going to be out of expectation.

Related patch [3] is also submitted from other contributor (so I'm not the
one to notice this behavior), whereas the patch might not look enough
generalized to apply.

First I decided to track the overall memory size of state provider cache
and expose to UI as well as query status (SPARK-24441 [4]). The metric
looked like critical and worth to monitor, so I thought it is better to
expose it (and watermark) to Dropwizard (SPARK-24637 [5]).

Based on adoption of SPARK-24441, I could find more flexible way to resolve
the issue (SPARK-24717) what I've mentioned in tl;dr.

So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
that it helps explaining worth of reviews for these patches.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24717
2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
3. https://github.com/apache/spark/pull/21500
4. https://issues.apache.org/jira/browse/SPARK-24441
5. https://issues.apache.org/jira/browse/SPARK-24637

ps. Before all mentioned issues I also submitted some other issues
regarding feature addition/refactor (2 of 5 issues).


2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <kabh...@gmail.com>님이 작성:

> Bump. I have been having hard time working on making additional PRs since
> some of these rely on non-merged PRs, so spending additional time to
> decouple these things if possible.
>
> https://github.com/apache/spark/pulls/HeartSaVioR
>
> Pending 5 PRs so far, and may add more sooner or later.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>
>> Kindly reminder since around 2 weeks passed. I've added more PR during 2
>> weeks and even planning to do more.
>>
>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>>
>>> Hi Spark devs,
>>>
>>> I have couple of pull requests for structured streaming which are
>>> getting older and fading out from earlier pages in PR pages.
>>>
>>> https://github.com/apache/spark/pull/21469
>>> https://github.com/apache/spark/pull/21357
>>> https://github.com/apache/spark/pull/21222
>>>
>>> Two of them are in a kind of approval by couple of folks, but no
>>> approval from committers yet.
>>> One of them needs rebase and I would be happy to do it after reviewing
>>> or in progress of reviewing.
>>>
>>> Getting reviewed in time would be critical for contributors to be
>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>
>>> Thanks in advance,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>

Reply via email to