Github user aalobaidi commented on the issue:
https://github.com/apache/spark/pull/21500
@HeartSaVioR
1. As I mentioned before, this option is beneficial for use cases with
bigger micro-batches. This way the overhead of loading the state from disk will
be spread across large number of events in each task. The overall throughput
will not be affected as bad as loading the state for small number of events. It
is typical space-time tradeoff. Using less memory space will result in increase
in execution time.
2. Currently, as far as I know, there is no way for a Spark to handle state
that is bigger than the total size of the cluster memory. With this option,
users can set the number of partitions (using `spark.sql.shuffle.partitions`)
to be, for example, 10 times the number of total cores in the cluster. The
result will be the cluster will load only ~10% of the state at any given time.
3. By default, the option is disabled and will not change the current
behavior. Also, enabling and disabling this option doesn't require rebuilding
the state. Users will be able test easily and decide if the performance impact
is acceptable or not.
4. In future implementations, we can have an executor-wide state manager,
that will evict state partitions from memory only when needed.
Also, I agree that a better solution should be developed, maybe using
RocksDB. But for the time being and for this store implementation, this will
enable one extra use case with very little code to maintain.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]