Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21733 @arunmahadevan I'm actually in favor of changing default behavior, just not 100% sure the result would be promising for exhaustive use cases. I might need to prepare more kinds of key/value pair (key size bigger than value size, key size smaller than value size, key size equals to value size, what else I'm missing here?) and run some tests and back it up with new numbers. Btw, as you commented, there seems two approaches to identify the old and new format: > looking at the fields in the row Actually I tried to do it before (via checking count of fields in value row, since this patch reduces the count of fields in value row), and soon realized I can't do it because HDFSBackedStateStoreProvider relies on provided keySchema and valueSchema when serializing / deserializing rows, not leveraging UnsafeRow's serialization/deserialization mechanism (writeExternal/readExternal or write/read via Kyro), so it will just show undefined behavior if the schema doesn't match with actual rows, and we can't verify this. Current approach saves cost to write/read two additional integers with sacrificing the way to verify the rows. If we would want to add the feature, state migration should be happened. > introducing a row version to differentiate old vs new We could do this via applying same approach in #21739 so this is valid, but query with old state format should do state migration (not easy to do since it should be done against multiple versions of states), or continue relying on old state format. @jose-torres Could you please take a look at @arunmahadevan 's comment as well as this comment and comment yours? Thanks in advance!
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org