Github user dibbhatt commented on the pull request:
https://github.com/apache/spark/pull/6614#issuecomment-109185001
Just to understand when the message count may go wrong ..here is my
understanding. Please let me know if I am wrong .
As I see the logic , if WAL is enabled , we are fine as
WriteAheadLogBasedBlockHandler serialize the block . If WAL is not set ,
BlockManagerBasedBlockHandler use BlockManager's putIterator call for both
Iterator and ArrayBuffer. If the storagelevel is ExternalStore or DiskStore ,
both will serialize the iterator and count comes fine... For MemoryStore also
it also does a unrollSafely of the iterator if enough memory is there. The
issue may come if memstore fails to do safe-unroll ( or does partial unroll )
dues to lack of free memory space. Then the reported count will show less than
actual number of records in block. This issue will happen irrespective of
replication factor.
Will this be a very common behavior ? In my environment for every storage
level I get correct count of number of kafka messages I pumped. Probably I need
to hit the memory limit of MemoryStore unrollSafely to run out of
memory....which seems to be rare scenario .
Let me know what you think.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]