SreeramGarlapati edited a comment on pull request #4473:
URL: https://github.com/apache/iceberg/pull/4473#issuecomment-1086892995


   @RussellSpitzer - thanks a lot for your review.
   
   The challenge with unit testing this - is that - sparkSession is holding on 
to the state of the stream - internally. We saw this issue when our streams in 
EMR cluster had to bought down and restarted on another EMR cluster & were able 
to verify this fix. So, all-in-all - I might be needing a unit test which spans 
across 2 sparkSessions & try this; the way tests are written is needing a bit 
of refactor to make it use 2 spark sessions & even after that I am unsure - if 
this will reproduce this. Is there any precedence in the codebase for this 
pattern that you could kindly point me to.
   
   I have put the fix before even I have that unit test before the public - 
**for visibility** - so ppl. can port it as needed. This bug is essentially a 
**time bomb** in the code (which blasts off the moment a stream moves across 
clusters) whoever is using readStream from iceberg table & have first snapshot 
expired; which is almost everyone who uses this! 
   
   Let me try that 2 spark sessions creation unittest and get back.
   
   putting the problem being explained above aside - the code itself has 
coverage in this unittest - `testResumingStreamReadFromCheckpoint` - using 
which I was able to debug thru & fix the code - where I need to use 
`file.createOrOverwrite()` instead of the existing `file.create()`.
   
   cc: @rajarshisarkar  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to