Github user koeninger commented on the issue:

    https://github.com/apache/spark/pull/21917
  
    > How do you know that offset 4 isn't just lost because poll failed?
    
    By failed, you mean returned an empty collection after timing out, even 
though records should be available?  You don't.  You also don't know that it 
isn't just lost because kafka skipped a message.  AFAIK from the information 
you have from a kafka consumer, once you start allowing gaps in offsets, you 
don't know.
    
    I understand your point, but even under your proposal you have no guarantee 
that the poll won't work in your first pass during RDD construction, and then 
fail on the executor during computation, right?
    
    > The issue with your proposal is that SeekToEnd gives you the last offset 
which might not be the last record.
    
    Have you tested comparing the results of consumer.endOffsets for consumers 
with different isolation levels?
    
    Your proposal might end up being the best approach anyway, just because of 
the unfortunate effect of StreamInputInfo and count, but I want to make sure we 
think this through.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to