[
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662007#comment-16662007
]
Jungtaek Lim edited comment on SPARK-10816 at 10/24/18 9:37 AM:
----------------------------------------------------------------
UPDATE: Since we don't access state session list by index (most of access
patterns are iteration), we could build a linked list per key which pointer is
"session start" in state which can avoid shift as well as keep ordering on
sessions. It would bring less overhead as well as avoiding sessions to be
loaded into memory.
Here's an implementation. To avoid full iteration, most of operations are
taking pointer as well, so they're rather to be seen as raw operations.
https://github.com/HeartSaVioR/spark/blob/SPARK-10816-WIP-linked-list/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SessionWindowLinkedListState.scala
Based on the implementation, I'm now trying to struggle with minimizing the
state sessions to load. Currently we need to handle 100100 sessions if there're
100 sessions in a batch and 100000 sessions in state, which most of sessions in
state will not be matched with sessions in a batch. Far from expectation, it is
going to be complicated, so dealing with tradeoff between complexity vs
optimization.
was (Author: kabhwan):
UPDATE: Since we don't access state session list by index (most of access
patterns are iteration), we could build a linked list per key which pointer is
"session start" in state which can avoid shift as well as keep ordering on
sessions. It would bring less overhead as well as avoiding sessions to be
loaded into memory.
Here's an implementation. To avoid full iteration, most of operations are
taking pointer as well, so they're rather to be seen as raw operations.
https://github.com/HeartSaVioR/spark/blob/SPARK-10816-WIP-linked-list/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SessionWindowLinkedListState.scala
Based on the implementation, I'm now trying to struggle with minimizing the
state sessions to load. Currently we need to handle 100100 sessions if there're
100 sessions in a batch and 100000 sessions in state, which most of sessions in
state will not be matched with sessions in a batch. It is going to be
complicated, so dealing with tradeoff between complexity vs optimization.
> EventTime based sessionization
> ------------------------------
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
> Issue Type: New Feature
> Components: Structured Streaming
> Reporter: Reynold Xin
> Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session
> Window Support For Structure Streaming.pdf
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]