Github user tdas commented on the pull request:
https://github.com/apache/spark/pull/4167#issuecomment-72144338
Well, it generally considered outside the scope of Spark Streaming to
windows that large! If you have to process data that is a day old, then you
probably need a dedicated storage system. Spark Streaming is not a storage
system, so using it for long term data storage is using it outside its design
space. Breaking default-case performance for those out-of-the-design-space
scenarios is not the right solution. Those should be handled by changing the
storage level directly. And users who need that sort of performance across such
large windows, obviously need to learn a bit more about Spark Streaming. We can
probably help them learn. maybe add some stuff in the programming guide?
An alternate, more sophisticated solution is to detect when such spillover
is continuously happening and print suggestion (log4j warnings) saying "not
enough memory to store the whole window, consider using memory_and_disk for the
windowed stream". This is definitely trickier to do but a safer solution that
does not involve regression.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]