[GitHub] [iceberg] SreeramGarlapati edited a comment on pull request #4473: Spark3.2: implement micro-batch streaming read - commitOffset

GitBox Sun, 03 Apr 2022 08:57:15 -0700


SreeramGarlapati edited a comment on pull request #4473:
URL: https://github.com/apache/iceberg/pull/4473#issuecomment-1086892995



   @RussellSpitzer - thanks a lot for your review.
   
   The challenge with unit testing this - is that - sparkSession is holding on 
to the state of the stream - internally - which makes it non-reproducible with 
1 sparkSession. We saw this issue when our streams in EMR cluster had to be 
bought down and restarted on another EMR cluster & were able to verify this 
fix. So, all-in-all - I might be needing a unit test which spans across 2 
sparkSessions & try this; the way tests are written is needing a bit of 
refactor to make it use 2 spark sessions & even after that I am unsure - if 
this will reproduce this. Is there any precedence in the codebase for this 
pattern that you could kindly point me to.
   
   I have put the fix before even I have that unit test before the public - 
**for visibility** - so ppl. can port it as needed. This bug is essentially a 
**time bomb** in the code (which blasts off the moment a stream moves across 
clusters) whoever is using readStream from iceberg table & have first snapshot 
expired; which is almost everyone who uses this! 
   
   Let me try that 2 spark sessions creation unittest and get back.
   
   putting the problem being explained above aside - the code itself has 
coverage in this unittest - `testResumingStreamReadFromCheckpoint` - using 
which I was able to debug thru & fix the code - where I need to use 
`file.createOrOverwrite()` instead of the existing `file.create()`.
   
   cc: @rajarshisarkar  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] SreeramGarlapati edited a comment on pull request #4473: Spark3.2: implement micro-batch streaming read - commitOffset

Reply via email to