Github user seyfe commented on the issue:

    https://github.com/apache/spark/pull/15371
  
    Hi @srowen, 
    
    This PR doesn't introduce any extra data copy operations. It moves the data 
copy code from `JsonProtocol:accumValueToJson` method to 
`BlockStatusesAccumulator:value` method so it can be called inside the 
synchronized block. But let me try to answer your question. 
    
    I checked the data size using 3 different pipelines. 99% of the time 
ArrayList has less than 4 items. There was one of case where it maxed at 4000 
items but that was less than 1% of the time. I ran my test with 4000 executors, 
I think that is why this 4000 number came up.
    
    I debated other options as well. Moving Json serialization into 
`BlockStatusesAccumulator` would help but I didn't want to mix match these 
classes. I considered introducing Akka Actors for thread safety but I think 
that would be an overkill in this scenario since data size is small. That is 
why I proposed this solution but I would love to hear your proposals.
    
    I don't know the answer for the second part of your question (below), but 
existing behavior is not changed. Only change is that we can convert ArrayList 
to a Scala List inside a synchronized block so we won't get 
`ConcurrentModificationException` . If we do need to rely on observing changes, 
I think we do need to have a bigger change. @zsxwing, would you please comment 
on that part? 
    >  nothing is actually relying on observing changes to the underlying data


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to