Github user dibbhatt commented on the pull request:

    https://github.com/apache/spark/pull/6614#issuecomment-109185001
  
    Just to understand when the message count may go wrong ..here is my 
understanding. Please let me know if I am wrong .
    As I see the logic , if WAL is enabled , we are fine as 
WriteAheadLogBasedBlockHandler serialize the block . If WAL is not set  , 
BlockManagerBasedBlockHandler use BlockManager's putIterator call for both 
Iterator and ArrayBuffer.  If the storagelevel is ExternalStore or DiskStore ,  
both will serialize the iterator and count comes fine... For MemoryStore also 
it also does a unrollSafely of the iterator if enough memory is there. The 
issue may come if memstore fails to do safe-unroll ( or does partial unroll ) 
dues to lack of free memory space. Then the reported count will show less than 
actual number of records in block.  This issue will happen irrespective of 
replication factor. 
    
    Will this be a very common behavior ? In my environment for every storage 
level I get correct count of number of kafka messages I pumped. Probably I need 
to hit the memory limit of MemoryStore unrollSafely to run out of 
memory....which seems to be rare scenario . 
    
    Let me know what you think. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to