HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-635603504
> One lesson I learned from the past is UnsafeRow is not designed to be persisted across Spark versions. This sounds like a blocker for SS, as we leverage it to store state which should be used across Spark versions and shouldn't be corrupted/lost as it's time-consuming or sometimes simply not possible to replay from the scratch to construct the same due to retention policy. Do you have references on following the discussion/gotcha for that issue? I think we should really fix it for storing state - probably state must not be stored via UnsafeRow format then. > Do you know what are the performance numbers if we just compress the text files? I haven't experimented but we can easily imagine the file size would be smaller whereas processing time may be affected both positively (less to read from remote storage) and negatively (still have to serialize/deserialize with JSON + cost to compression). Here the point of the PR is that we know the schema (and even versioning) of the file in prior, hence we don't (and shouldn't) pay huge cost to make the file be backward-compatible by itself. We don't do versioning for data structures being used by event log so we are paying huge cost to make it backward/forward compatible. If we are not sure about unsafe row format for storing then we may be able to just try with traditional approaches. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org