gianm commented on issue #8125: Leave a single CachePopulator and choose 
sync/async mode automatically
URL: 
https://github.com/apache/incubator-druid/issues/8125#issuecomment-513971684
 
 
   Maybe it's just me, but I find foreground/background and sync/async to be 
equally understandable.
   
   > There could be a single implementation. The executor may be 
DirectExecutorService in the "sync" case.
   
   A note: at least currently, the control flow is different enough that the 
foreground populator cannot be implement as a background populator with a 
direct executor service. I think the main reason is that the background 
populator isn't only trying to do the _population_ in the background, it's also 
trying to do _computation of cache entries_ in the background _and in parallel_.
   
   The latter thing means that it needs to take a different approach to the 
foreground one. In particular, the foreground populator is simple: it writes 
something to a ByteArrayOutputStream for each item as the Sequence is walked. 
The background populator instead writes these items into a List and then starts 
up separate thread pool tasks to serialize each one, and then also populates 
the cache using yet another task.
   
   In other words, if the foreground populator was replaced with a background 
populator + direct executor, it would use a lot more memory than it does today, 
since it would need to buffer up a List of items that it currently just writes 
immediately to a ByteArrayOutputStream.
   
   > It seems to me that the crucial difference between sync and async is 
whether blocking cache (Redis/Memcached) or local, (mostly) non-blocking cache 
(Caffeine) is used. At least, this should be highlighted in docs; at most, the 
mode of cachePopulator should be chosen automatically depending on the 
configured type of the cache.
   
   It's not necessarily that simple, for two reasons:
   
   1. Some remote cache implementations (like memcached) queue things 
internally and return from `cache.put` _before_ the remote server acknowledges 
the write. So background population doesn't add any extra resistance to network 
based blocking.
   2. Even for local cache implementations, background population could in 
theory be useful, since it does cache entry generation in the background too. 
This could in theory be expensive enough (CPU-wise) that it is useful to 
background it.
   
   > Does it make any sense to make async cache populator which pushes results 
to remote cache to have more than 1 thread? After all, we will be hitting the 
same bottleneck - the local NIC. So async populator may just have a queue and a 
single thread takes cache entries from the queue and tries to push them to 
remote cache.
   
   It depends how exactly the Cache impl works. If its `put` method is 
blocking, then having more than one thread is useful, since in that case you 
can get more bandwidth by doing multiple connections requests in parallel 
(otherwise, you have too much 'dead air' while waiting for responses). If its 
`put` method is CPU intensive (maybe it must wrap the provided `byte[]` in an 
envelope of sorts, involving a lot of memory copies) then having more than one 
thread could also improve throughput (but maybe a bad idea, since it's scary if 
you're spending so much CPU on putting stuff into the query cache). Otherwise, 
it's probably not useful to have more than one thread.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to