Fwd: [Haskell-cafe] For STM to practically serve large in-mem datasets with cyclic structures WAS: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

YueCompl via ghc-devs Thu, 30 Jul 2020 06:56:18 -0700

Dear GHC Devs,

I realize I should seek your helps regarding my current situation.


TL;DR the most serious suffering is: After the heap loaded with many TVars with 
cyclic structures, GC will dominate my CPU utilization with little business 
progressing.

Nonmoving GC `-xn` with 8.10.1 helps, but the ceiling is not high enough for my 
case, obvious performance degrade starts at about ~350MB RSS, and falls 
unusable as RSS approaching 1GB. While I expect a GHC compiled process to serve 
20~30GB in-mem data practically okay.

https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132556.html 
<https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132556.html> 
contains most interesting conversations.

I found 
https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html
 
<https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html>
 much relevant, they managed to keep 100GB per instance, but switching to 
immutable data structures to reside within compact regions is not immediately 
feasible for my case, as the schema has to be re-designed, though I have my 
colleges started evaluating that option.

Best regards,
Compl


> Begin forwarded message:
> 
> From: YueCompl via Haskell-Cafe <haskell-c...@haskell.org>
> Subject: [Haskell-cafe] For STM to practically serve large in-mem datasets 
> with cyclic structures WAS: STM friendly TreeMap (or similar with range scan 
> api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM 
> threads, with hot TVar updates?
> Date: 2020-07-30 at 21:28:31 GMT+8
> To: Haskell Cafe <haskell-c...@haskell.org>
> Reply-To: YueCompl <compl....@icloud.com>
> 
> For the record, overhead of STM over IO (or other means where manual 
> composition of transactions needed) based concurrency control, is a price I'm 
> willing to pay in my use case, as it's not machine-performance critical in 
> distributing input data + parameters to a cluster of worker nodes, and 
> collecting their results into permanent storage or a data pipeline. But to 
> keep professionally crafting well synced, race-free scheduling code is barely 
> affordable by my org, as shape of datasets, relationship between them and 
> algorithms processing them are varying at fast paces, we have difficulty, or 
> lack the willingness, to hire some workforce specifically to keep each new 
> data pipeline race free, it has to be, but better at cost of machine-hours 
> instead of human head counts. 
> 
> While easily compositing stm code, wrapped in scriptable procedures, will 
> enable our analysts to author the scheduling scripts without too much 
> concerns. Then our programmers can focus on performance critical parts of the 
> data processing code, like optimization of tight-loops.
> 
> Only if not in the tight loops, I think it's acceptable by us, that up to 2~3 
> order of magnitude slower for an stm solution compared to its best rivals, as 
> long as it's scalable. For a (maybe cheating) example, if fully optimized 
> code can return result in 10 ms after an analyst clicked a button, we don't 
> bother if unoptimized stm script needs 10 second, so long as the result is 
> correct.
> 
> In a philosophic thinking, I heard that AT&T had UNIX specifically designed 
> for their Control panel, while their Data panel runs separate software (and 
> on separate hardware obviously), while modern systems have powerful CPUs 
> tempting us to squeeze more performance out of it, and SIMD instructions make 
> it even more tempting, I think we'd better resist it when programming 
> something belong to the Control panel per se, but do it in programming 
> something belong to the Data panel. And appears Data panel programs are being 
> shifted to GPUs nowadays, which feels right.
> 
> Regards,
> Compl
> 
> 
>> On 2020-07-30, at 20:10, YueCompl via Haskell-Cafe <haskell-c...@haskell.org 
>> <mailto:haskell-c...@haskell.org>> wrote:
>> 
>> Hi Peter,
>> 
>> Great to hear from you! 
>> 
>> For the record tskiplist (and stm-containers together) did improve my 
>> situation a great lot with respect to scalability at concurrency/parallelism!
>> 
>> I'm still far from the stage to squeeze last drops of performance, currently 
>> I'm just making sure performance wise concerns be reasonable during my PoC 
>> in correctness and ergonomics of my HPC architecture (an in-memory graph + 
>> out-of-core (mmap) array DBMS powered computation cluster, with shared 
>> storage), and after parallelism appears acceptable, I seemingly suffer from 
>> serious GC issue at up scaling on process working memory size. I'm 
>> suspecting it's because of the added more TVars and/or aggressive circular 
>> structures of them in my case, and can not find a way to overcome it by far.
>> 
>> Thanks for your detailed information!
>> 
>> Best regards,
>> Compl
>> 
>> 
>>> On 2020-07-30, at 19:19, Peter Robinson <p...@lowerbound.io 
>>> <mailto:p...@lowerbound.io>> wrote:
>>> 
>>> Hi Compl,
>>> >+ This package provides a proof-of-concept implementation of a skip list 
>>> >in STM
>>> 
>>> This has to mean something but I can't figure out yet.
>>> 
>>> Dear Peter Robinson, I hope you can see this message and get in the loop of 
>>> discussion. 
>>> 
>>> 
>>>  The reason for adding this sentence was that tskiplist hasn't been 
>>> optimized for production use. Later on, I wrote an implementation of a 
>>> concurrent skip list with atomic operations that performs significantly 
>>> better, but it's operations work in the IO monad. 
>>> 
>>> I'm surprised to hear that you're getting poor performance even when using 
>>> the stm-container package, which I believe was meant to be used in 
>>> production. A while ago, I ran some benchmarks comparing concurrent 
>>> dictionary data structures (such as stm-container) under various workloads. 
>>> While STMContainers.Map wasn't as fast as the concurrent-hashtable package, 
>>> the results indicate that the performance doesn't degrade too much under 
>>> larger workloads.
>>> 
>>> You can find these benchmark results here (10^6 randomly generated 
>>> insertion/deletion/lookup requests distributed among 32 threads): 
>>> https://lowerbound.io/blog/bench2-32.html 
>>> <https://lowerbound.io/blog/bench2-32.html>   
>>> And some explanations about the benchmarks are here:  
>>> https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html
>>>  
>>> <https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html>
>>>   
>>> One issue that I came across when implementing the tskiplist package was 
>>> this: If a thread wants to insert some item into the skip list, it needs to 
>>> search for the entry point by performing readTVar operations starting at 
>>> the list head. So, on average, a thread will read O(log n) TVars (assuming 
>>> a skip list of n items) and, if any of these O(log n) TVars are modified by 
>>> a simultaneously running thread, the STM runtime will observe a (false) 
>>> conflict and rerun the transaction. It's not clear to me how to resolve 
>>> this issue without access to something like unreadTVar (see [1]).  
>>> 
>>> Best,
>>> Peter
>>> 
>>> [1] UnreadTVar: Extending Haskell Software Transactional Memory for 
>>> Performance (2007)  by Nehir Sonmez , Cristian Perfumo , Srdjan Stipic , 
>>> Adrian Cristal , Osman S. Unsal , Mateo Valero.
>>> 
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> To (un)subscribe, modify options or view archives go to:
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe 
>>> <http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe>
>>> Only members subscribed via the mailman list are allowed to post.
>> 
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe 
>> <http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe>
>> Only members subscribed via the mailman list are allowed to post.
> 
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.

_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Fwd: [Haskell-cafe] For STM to practically serve large in-mem datasets with cyclic structures WAS: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Reply via email to