SreeramGarlapati edited a comment on pull request #4473: URL: https://github.com/apache/iceberg/pull/4473#issuecomment-1086892995
@RussellSpitzer - thanks a lot for your review. The challenge with unit testing this - is that - sparkSession is holding on to the state of the stream - internally - which makes it non-reproducible with 1 sparkSession. We saw this issue when our streams in EMR cluster had to be bought down and restarted on another EMR cluster & were able to verify this fix. So, all-in-all - I might be needing a unit test which spans across 2 sparkSessions & try this; the way tests are written is needing a bit of refactor to make it use 2 spark sessions & even after that I am unsure - if this will reproduce this. Is there any precedence in the codebase for this pattern that you could kindly point me to. I have put the fix before even I have that unit test before the public - **for visibility** - so ppl. can port it as needed. This bug is essentially a **time bomb** in the code (which blasts off the moment a stream moves across clusters) whoever is using readStream from iceberg table & have first snapshot expired; which is almost everyone who uses this! Let me try that 2 spark sessions creation unittest and get back. putting the problem being explained above aside - the code itself has coverage in this unittest - `testResumingStreamReadFromCheckpoint` - using which I was able to debug thru & fix the code - where I need to use `file.createOrOverwrite()` instead of the existing `file.create()`. cc: @rajarshisarkar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
