anishshri-db commented on PR #43425: URL: https://github.com/apache/spark/pull/43425#issuecomment-1769113929
@HeartSaVioR - some high level questions - it seems like we only expose partition Id as a metadata/internal column and that too only if its queried ? We don't seem to expose other cols such as batchId/operatorId etc. What is the reason for doing this ? - for some of the queries such as join/FMGWS, it seems that we have different formats for v1/v2 and the user needs to query it differently within the selectExpr. How does the user discover these fields ? Is it possible to keep the source schema homogenous here ? - for join queries, what schema do we expose when a store name is explicitly specified vs not. I guess the ability to query a specific store name (esp the ones like right-keyToNumValues) is only really for debugging purposes in this case ? Also, for join queries, where do we add the internal metadata cols like partitionId - not sure I found that - for the tests, not sure I saw a simulation for the expected use-cases. for eg - some tests where we keep the streaming query running for a few batches and assert for certain conditions/state values along the way. Also, maybe around corruption detection where we artificially corrupt some values and show how the state reader can detect those ? - For tests, should we also add some cases with additional startStream/stopStream clauses and verify that state read is working as expected even when batch recovery/restart cases are involved ? Thanks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
