Re: [Haskell-cafe] For STM to practically serve large in-mem datasets with cyclic structures WAS: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

YueCompl via ghc-devs Thu, 30 Jul 2020 07:09:25 -0700

And I have some uneducated guess also in 
https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132559.html 
<https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132559.html>


> And per my understanding, GHC's GC doesn't seek free segments within a heap, 
> it instead will copy all live objects to a new heap after then swap the new 
> heap to be the live one, so I assume memory (address space) fragmentation 
> doesn't make much trouble for a GHC process, as for other runtimes.

> I suspect the difficulty resides in the detection of circular/cyclic 
> circumstances wrt live data structures within the old heap, especially the 
> circles form with arbitrary number of pointers of indirection. If the GC has 
> to perform some dict lookup to decide if an object has been copied to new 
> heap, that's O(n*log(n)) complexity in best case, where n is number of live 
> objects in the heap.

> To efficiently copy circular structures, one optimization I can imagine is to 
> have a `new ptr` field in every heap object, then in copying another object 
> with a pointer to one object, the `new ptr` can be read out and if not nil, 
> assign the pointer field on another object' in the new heap to that value and 
> it's done; or copy one object' to the new heap, and update the field on one 
> object in the old heap pointing to the new heap. But I don't know details of 
> GHC GC and can't imagine even feasibility of this technique. And even the new 
> nonmoving GC may have similar difficulty to jump out of a circle when 
> following pointers.

FYI

> On 2020-07-30, at 21:53, YueCompl via ghc-devs <ghc-devs@haskell.org> wrote:
> 
> Dear GHC Devs,
> 
> I realize I should seek your helps regarding my current situation.
> 
> TL;DR the most serious suffering is: After the heap loaded with many TVars 
> with cyclic structures, GC will dominate my CPU utilization with little 
> business progressing.
> 
> Nonmoving GC `-xn` with 8.10.1 helps, but the ceiling is not high enough for 
> my case, obvious performance degrade starts at about ~350MB RSS, and falls 
> unusable as RSS approaching 1GB. While I expect a GHC compiled process to 
> serve 20~30GB in-mem data practically okay.
> 
> https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132556.html 
> <https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132556.html> 
> contains most interesting conversations.
> 
> I found 
> https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html
>  
> <https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html>
>  much relevant, they managed to keep 100GB per instance, but switching to 
> immutable data structures to reside within compact regions is not immediately 
> feasible for my case, as the schema has to be re-designed, though I have my 
> colleges started evaluating that option.
> 
> Best regards,
> Compl
> 
> 
>> Begin forwarded message:
>> 
>> From: YueCompl via Haskell-Cafe <haskell-c...@haskell.org 
>> <mailto:haskell-c...@haskell.org>>
>> Subject: [Haskell-cafe] For STM to practically serve large in-mem datasets 
>> with cyclic structures WAS: STM friendly TreeMap (or similar with range scan 
>> api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM 
>> threads, with hot TVar updates?
>> Date: 2020-07-30 at 21:28:31 GMT+8
>> To: Haskell Cafe <haskell-c...@haskell.org <mailto:haskell-c...@haskell.org>>
>> Reply-To: YueCompl <compl....@icloud.com <mailto:compl....@icloud.com>>
>> 
>> For the record, overhead of STM over IO (or other means where manual 
>> composition of transactions needed) based concurrency control, is a price 
>> I'm willing to pay in my use case, as it's not machine-performance critical 
>> in distributing input data + parameters to a cluster of worker nodes, and 
>> collecting their results into permanent storage or a data pipeline. But to 
>> keep professionally crafting well synced, race-free scheduling code is 
>> barely affordable by my org, as shape of datasets, relationship between them 
>> and algorithms processing them are varying at fast paces, we have 
>> difficulty, or lack the willingness, to hire some workforce specifically to 
>> keep each new data pipeline race free, it has to be, but better at cost of 
>> machine-hours instead of human head counts. 
>> 
>> While easily compositing stm code, wrapped in scriptable procedures, will 
>> enable our analysts to author the scheduling scripts without too much 
>> concerns. Then our programmers can focus on performance critical parts of 
>> the data processing code, like optimization of tight-loops.
>> 
>> Only if not in the tight loops, I think it's acceptable by us, that up to 
>> 2~3 order of magnitude slower for an stm solution compared to its best 
>> rivals, as long as it's scalable. For a (maybe cheating) example, if fully 
>> optimized code can return result in 10 ms after an analyst clicked a button, 
>> we don't bother if unoptimized stm script needs 10 second, so long as the 
>> result is correct.
>> 
>> In a philosophic thinking, I heard that AT&T had UNIX specifically designed 
>> for their Control panel, while their Data panel runs separate software (and 
>> on separate hardware obviously), while modern systems have powerful CPUs 
>> tempting us to squeeze more performance out of it, and SIMD instructions 
>> make it even more tempting, I think we'd better resist it when programming 
>> something belong to the Control panel per se, but do it in programming 
>> something belong to the Data panel. And appears Data panel programs are 
>> being shifted to GPUs nowadays, which feels right.
>> 
>> Regards,
>> Compl
>> 
>> 
>>> On 2020-07-30, at 20:10, YueCompl via Haskell-Cafe 
>>> <haskell-c...@haskell.org <mailto:haskell-c...@haskell.org>> wrote:
>>> 
>>> Hi Peter,
>>> 
>>> Great to hear from you! 
>>> 
>>> For the record tskiplist (and stm-containers together) did improve my 
>>> situation a great lot with respect to scalability at 
>>> concurrency/parallelism!
>>> 
>>> I'm still far from the stage to squeeze last drops of performance, 
>>> currently I'm just making sure performance wise concerns be reasonable 
>>> during my PoC in correctness and ergonomics of my HPC architecture (an 
>>> in-memory graph + out-of-core (mmap) array DBMS powered computation 
>>> cluster, with shared storage), and after parallelism appears acceptable, I 
>>> seemingly suffer from serious GC issue at up scaling on process working 
>>> memory size. I'm suspecting it's because of the added more TVars and/or 
>>> aggressive circular structures of them in my case, and can not find a way 
>>> to overcome it by far.
>>> 
>>> Thanks for your detailed information!
>>> 
>>> Best regards,
>>> Compl
>>> 
>>> 
>>>> On 2020-07-30, at 19:19, Peter Robinson <p...@lowerbound.io 
>>>> <mailto:p...@lowerbound.io>> wrote:
>>>> 
>>>> Hi Compl,
>>>> >+ This package provides a proof-of-concept implementation of a skip list 
>>>> >in STM
>>>> 
>>>> This has to mean something but I can't figure out yet.
>>>> 
>>>> Dear Peter Robinson, I hope you can see this message and get in the loop 
>>>> of discussion. 
>>>> 
>>>> 
>>>>  The reason for adding this sentence was that tskiplist hasn't been 
>>>> optimized for production use. Later on, I wrote an implementation of a 
>>>> concurrent skip list with atomic operations that performs significantly 
>>>> better, but it's operations work in the IO monad. 
>>>> 
>>>> I'm surprised to hear that you're getting poor performance even when using 
>>>> the stm-container package, which I believe was meant to be used in 
>>>> production. A while ago, I ran some benchmarks comparing concurrent 
>>>> dictionary data structures (such as stm-container) under various 
>>>> workloads. While STMContainers.Map wasn't as fast as the 
>>>> concurrent-hashtable package, the results indicate that the performance 
>>>> doesn't degrade too much under larger workloads.
>>>> 
>>>> You can find these benchmark results here (10^6 randomly generated 
>>>> insertion/deletion/lookup requests distributed among 32 threads): 
>>>> https://lowerbound.io/blog/bench2-32.html 
>>>> <https://lowerbound.io/blog/bench2-32.html>   
>>>> And some explanations about the benchmarks are here:  
>>>> https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html
>>>>  
>>>> <https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html>
>>>>   
>>>> One issue that I came across when implementing the tskiplist package was 
>>>> this: If a thread wants to insert some item into the skip list, it needs 
>>>> to search for the entry point by performing readTVar operations starting 
>>>> at the list head. So, on average, a thread will read O(log n) TVars 
>>>> (assuming a skip list of n items) and, if any of these O(log n) TVars are 
>>>> modified by a simultaneously running thread, the STM runtime will observe 
>>>> a (false) conflict and rerun the transaction. It's not clear to me how to 
>>>> resolve this issue without access to something like unreadTVar (see [1]).  
>>>> 
>>>> Best,
>>>> Peter
>>>> 
>>>> [1] UnreadTVar: Extending Haskell Software Transactional Memory for 
>>>> Performance (2007)  by Nehir Sonmez , Cristian Perfumo , Srdjan Stipic , 
>>>> Adrian Cristal , Osman S. Unsal , Mateo Valero.
>>>> 
>>>> _______________________________________________
>>>> Haskell-Cafe mailing list
>>>> To (un)subscribe, modify options or view archives go to:
>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe 
>>>> <http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe>
>>>> Only members subscribed via the mailman list are allowed to post.
>>> 
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> To (un)subscribe, modify options or view archives go to:
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe 
>>> <http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe>
>>> Only members subscribed via the mailman list are allowed to post.
>> 
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe 
>> <http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe>
>> Only members subscribed via the mailman list are allowed to post.
> 
> _______________________________________________
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Re: [Haskell-cafe] For STM to practically serve large in-mem datasets with cyclic structures WAS: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Reply via email to