Github user aalobaidi commented on the issue:

    https://github.com/apache/spark/pull/21500
  
    @HeartSaVioR 
    
    1. As I mentioned before, this option is beneficial for use cases with 
bigger micro-batches. This way the overhead of loading the state from disk will 
be spread across large number of events in each task. The overall throughput 
will not be affected as bad as loading the state for small number of events. It 
is typical space-time tradeoff. Using less memory space will result in increase 
in execution time. 
    
    2. Currently, as far as I know, there is no way for a Spark to handle state 
that is bigger than the total size of the cluster memory. With this option, 
users can set the number of partitions (using `spark.sql.shuffle.partitions`)  
to be, for example, 10 times the number of total cores in the cluster. The 
result will be the cluster will load only ~10% of the state at any given time. 
    
    3. By default, the option is disabled and will not change the current 
behavior. Also, enabling and disabling this option doesn't require rebuilding 
the state. Users will be able test easily and decide if the performance impact 
is acceptable or not.
    
    4. In future implementations, we can have an executor-wide state manager, 
that will evict state partitions from memory only when needed. 
    
    Also, I agree that a better solution should be developed, maybe using 
RocksDB. But for the time being and for this store implementation, this will 
enable one extra use case with very little code to maintain. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to