[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

ksakellis Mon, 02 Feb 2015 09:30:55 -0800

Github user ksakellis commented on the pull request:

    https://github.com/apache/spark/pull/4067#issuecomment-72498987
  
    @pwendell I'm not sure how we can do what you propose without having an 
O(n) loop through all the records before passing the InterruptableIterator? We 
could do something fancy like counting incrementally and when we finish the 
task, if there are more records left, then do the loop to count the rest of the 
unread records. I don't think the complication is worth it. Also, reporting the 
accurate records read i think is better. 
    
    Alternatively, we can fix the bytesRead to be more accurate. Right now they 
are computed in ShuffleBlockFetcherIterator and calculated based on the blocks 
fetched. Since we do the flatMap on that iterator in BlockStoreShuffleFetcher 
we report that we read all the bytes even if we didn't. We can move the 
bytesRead collection out of ShuffleBlockFetcherIterator and move it into the 
same iterator that computes the # records read. So they line up and are more 
accurate.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

Reply via email to