Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/7536#issuecomment-137443861
  
    On the hadoop size getLength() is part of the required interface so the 
function would be there.  With all the inputs formats I've seen this is always 
set to something reasonable but anyone can write a custom input format.  I 
could see cases where someone has an input format that they don't know the size 
or its more expensive to compute the size then just to fetch it.   You could 
simply add a check for 0 and fall back to the # of partitions.
    
    @watermen have you run this on real cluster with skewed data to see if it 
makes a difference?  what input formats have you used?
    
    if there are thousands (or tens of thousands) of partitions and you are 
coalescing into small # of buckets we are now potentially calculating the 
length in every group over and over again.  did you test to see how long that 
takes vs just checking the size of the array?  I'm guessing that isn't to bad 
but it doesn't hurt to verify.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to