anishshri-db commented on PR #43425:
URL: https://github.com/apache/spark/pull/43425#issuecomment-1769113929

   @HeartSaVioR - some high level questions
   
   - it seems like we only expose partition Id as a metadata/internal column 
and that too only if its queried ? We don't seem to expose other cols such as 
batchId/operatorId etc. What is the reason for doing this ?
   - for some of the queries such as join/FMGWS, it seems that we have 
different formats for v1/v2 and the user needs to query it differently within 
the selectExpr. How does the user discover these fields ? Is it possible to 
keep the source schema homogenous here ?
   - for join queries, what schema do we expose when a store name is explicitly 
specified vs not. I guess the ability to query a specific store name (esp the 
ones like right-keyToNumValues) is only really for debugging purposes in this 
case ? Also, for join queries, where do we add the internal metadata cols like 
partitionId - not sure I found that
   - for the tests, not sure I saw a simulation for the expected use-cases. for 
eg - some tests where we keep the streaming query running for a few batches and 
assert for certain conditions/state values along the way. Also, maybe around 
corruption detection where we artificially corrupt some values and show how the 
state reader can detect those ?
   - For tests, should we also add some cases with additional 
startStream/stopStream clauses and verify that state read is working as 
expected even when batch recovery/restart cases are involved ?
   
   Thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to