[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

HeartSaVioR Wed, 11 Jul 2018 14:26:12 -0700

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/spark/pull/21733
  
    @arunmahadevan 
    I'm actually in favor of changing default behavior, just not 100% sure the 
result would be promising for exhaustive use cases. I might need to prepare 
more kinds of key/value pair (key size bigger than value size, key size smaller 
than value size, key size equals to value size, what else I'm missing here?) 
and run some tests and back it up with new numbers.
    
    Btw, as you commented, there seems two approaches to identify the old and 
new format:
    
    > looking at the fields in the row
    
    Actually I tried to do it before (via checking count of fields in value 
row, since this patch reduces the count of fields in value row), and soon 
realized I can't do it because HDFSBackedStateStoreProvider relies on provided 
keySchema and valueSchema when serializing / deserializing rows, not leveraging 
UnsafeRow's serialization/deserialization mechanism (writeExternal/readExternal 
or write/read via Kyro), so it will just show undefined behavior if the schema 
doesn't match with actual rows, and we can't verify this.
    
    Current approach saves cost to write/read two additional integers with 
sacrificing the way to verify the rows. If we would want to add the feature, 
state migration should be happened.
    
    > introducing a row version to differentiate old vs new
    
    We could do this via applying same approach in #21739 so this is valid, but 
query with old state format should do state migration (not easy to do since it 
should be done against multiple versions of states), or continue relying on old 
state format.
    
    @jose-torres Could you please take a look at @arunmahadevan 's comment as 
well as this comment and comment yours? Thanks in advance!



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

Reply via email to