davidnavas commented on issue #24616: [SPARK-27726] [Core] Fix performance of 
ElementTrackingStore deletes when using InMemoryStore under high loads
URL: https://github.com/apache/spark/pull/24616#issuecomment-493171343
 
 
   Updated PR
   >>>
   At some point I thought about allowing "unordered views", which I think 
would also fix the main performance problem with the deletes (which is the copy 
+ sort operation). But I kinda like your new interface (bar some naming nits).
   <<<
   
   Yes, that's where I started too -- let me share with you why I ended up with 
removeAll anyway...
   
   Setup:
   Define N as the number of entities for a particular type
   Define MAX as the maximum number of entities that the ElementStore is tasked 
to retain  (iirc, for stages, this is 1000)
   Define K as the number of elements being deleted in one go.
   
   So, first thing I did was to remove the sort if first==last on the view, 
which isn't correct in general, but certainly surfaces an unordered view.  The 
problem is that for N >> MAX, K ~= N, and the basic algorithm is still O(N^2) 
because we're filtering the whole list on each pass.
   
   Having got my head on straight, the second thing I did was to move the 
entire delete out of the stage loop.  Then I noticed that you had already done 
that for Tasks.  But I was annoyed by the implied sort that happens anyway, so 
I implemented a removeIf construct (attempt unordered view part 2).  This held 
up really well, until I performance tested with LevelDB.
   
   LevelDB works GREAT with the original implementation.  It's not so wonderful 
when N << MAX, because then the delete is O(N) instead of O(K).  I hate having 
to explain a performance regression on a performance PR, and that's why I moved 
to removeAll and separate implementations for each Store.  It's actually about 
half the speed of the removeIf construct for InMemory, probably because of all 
the wrappers being built around the indexValues (stage naturalKeys).  And of 
course, InMemoryStore still is O(N) and not O(K) on removal for N << MAX.  But 
it's 100x faster than LevelDB, and isn't anyway a regression in behavior.
   
   removeAll was definitely not my first go-to.  :>
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to