Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim Sun, 15 Jul 2018 21:04:59 -0700

Bump. I got couple of review comments from contributors including soft
LGTM, but still haven't got any (non code style) review from committers, so
technically haven't have any progress to be merged.


I'm planning to work on adding new feature as well, but it's not easy for
me to concentrate on something with also concerning to maintain 6 existing
pull requests. Merge conflicts would be matter on maintaining, especially
other pull requests (submitted later than my pull requests) are getting
reviewed and merged.

I'd like to ask any structured streaming related committer to take a look
at pull requests.

- Jungtaek Lim (HeartSaVioR)

2018년 7월 12일 (목) 오후 10:41, Jungtaek Lim <[email protected]>님이 작성:

> I recently added more test results to SPARK-24763 [1] which shows that the
> proposal reduces state size according to the ratio of key-value size,
> whereas there's no performance hit and sometimes even slight boost.
>
> Please refer the latest comment in JIRA issue [2] to see the numbers from
> perf. tests.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24763
> 2.
> https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367
>
>
> 2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <[email protected]>님이 작성:
>
>> Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
>> option to enable optimization of state size in streaming aggregation
>> without hurting performance.
>>
>> The idea is to remove data for key fields from value which is duplicated
>> between key and value in state row. This requires additional operations
>> like projection and join, but smaller state row would also give performance
>> benefit, which can offset each other.
>>
>> Please refer the comment in JIRA issue [2] to see the numbers from simple
>> perf. test.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24763
>>
>>
>> 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <[email protected]>님이 작성:
>>
>>> Ted Yu suggested posting the improved numbers to this thread and I think
>>> it's good idea, so also posting here, but I also think explaining
>>> rationalization of my issues would help understanding why I'm submitting
>>> couple of patches, so I'll explain it first. (Sorry to post a wall of text).
>>>
>>> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
>>> store provider from 10x~80x of size of state for a batch according to
>>> various stateful workloads to less than or around 2x. The new option is
>>> flexible so it can be even around 1x or even effectively disable cache.
>>> Please refer the comment in the PR [2] to get more details. (hard to post
>>> detailed numbers in mail format so link a Github comment instead)
>>>
>>> I have interest on stateful streaming processing on Structured
>>> Streaming, and have been learning from codebase as well as analyzing memory
>>> usage as well as latency (while I admit it is hard to measure latency
>>> correctly...).
>>>
>>>
>>> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>>>
>>> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
>>> excessive caching. As I described in section "The impact of storing
>>> multiple versions from HDFSBackedStateStoreProvider" in above link, while
>>> multiple versions share the same UnsafeRow unless there's a change on the
>>> value which lessen the impact of caching multiple versions (credit to Jose
>>> Torres since I realized it from his comment). But in some workloads which
>>> lots of writes to state incurs in a batch, the overall memory usage of
>>> state is going to be out of expectation.
>>>
>>> Related patch [3] is also submitted from other contributor (so I'm not
>>> the one to notice this behavior), whereas the patch might not look enough
>>> generalized to apply.
>>>
>>> First I decided to track the overall memory size of state provider cache
>>> and expose to UI as well as query status (SPARK-24441 [4]). The metric
>>> looked like critical and worth to monitor, so I thought it is better to
>>> expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>>>
>>> Based on adoption of SPARK-24441, I could find more flexible way to
>>> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>>>
>>> So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
>>> that it helps explaining worth of reviews for these patches.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. https://issues.apache.org/jira/browse/SPARK-24717
>>> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
>>> 3. https://github.com/apache/spark/pull/21500
>>> 4. https://issues.apache.org/jira/browse/SPARK-24441
>>> 5. https://issues.apache.org/jira/browse/SPARK-24637
>>>
>>> ps. Before all mentioned issues I also submitted some other issues
>>> regarding feature addition/refactor (2 of 5 issues).
>>>
>>>
>>> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <[email protected]>님이 작성:
>>>
>>>> Bump. I have been having hard time working on making additional PRs
>>>> since some of these rely on non-merged PRs, so spending additional time to
>>>> decouple these things if possible.
>>>>
>>>> https://github.com/apache/spark/pulls/HeartSaVioR
>>>>
>>>> Pending 5 PRs so far, and may add more sooner or later.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <[email protected]>님이 작성:
>>>>
>>>>> Kindly reminder since around 2 weeks passed. I've added more PR during
>>>>> 2 weeks and even planning to do more.
>>>>>
>>>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[email protected]>님이 작성:
>>>>>
>>>>>> Hi Spark devs,
>>>>>>
>>>>>> I have couple of pull requests for structured streaming which are
>>>>>> getting older and fading out from earlier pages in PR pages.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/21469
>>>>>> https://github.com/apache/spark/pull/21357
>>>>>> https://github.com/apache/spark/pull/21222
>>>>>>
>>>>>> Two of them are in a kind of approval by couple of folks, but no
>>>>>> approval from committers yet.
>>>>>> One of them needs rebase and I would be happy to do it after
>>>>>> reviewing or in progress of reviewing.
>>>>>>
>>>>>> Getting reviewed in time would be critical for contributors to be
>>>>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>

Re: Asking for reviewing PRs regarding structured streaming

Reply via email to