Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/4167#issuecomment-72144338
  
    Well, it generally considered outside the scope of Spark Streaming to 
windows that large! If you have to process data that is a day old, then you 
probably need a dedicated storage system. Spark Streaming is not a storage 
system, so using it for long term data storage is using it outside its design 
space. Breaking default-case performance for those out-of-the-design-space 
scenarios is not the right solution. Those should be handled by changing the 
storage level directly. And users who need that sort of performance across such 
large windows, obviously need to learn a bit more about Spark Streaming. We can 
probably help them learn. maybe add some stuff in the programming guide?
    
    An alternate, more sophisticated solution is to detect when such spillover 
is continuously happening and print suggestion (log4j warnings) saying "not 
enough memory to store the whole window, consider using memory_and_disk for the 
windowed stream". This is definitely trickier to do but a safer solution that 
does not involve regression. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to