nickva commented on pull request #651: URL: https://github.com/apache/couchdb-documentation/pull/651#issuecomment-881764357
@kocolosk Good point, the locality trick would be useful internally to say process the changes feed for the indexing but wouldn't help with write hotspots. The design the _changes feed external API is pretty neat and I think it may be worth going that way eventually but perhaps with an auto-sharding set up so that users don't have to think about Q at all. Found a description of how FDB backup system avoids hot write shards https://forums.foundationdb.org/t/keyspace-partitions-performance/168/2. Apparently it's based on writing to `(hash(version/1e6), version)` key ranges, to have a balance between being able to query ranges but also avoid writing more than 1 second of data (by default versions advance at a rate of about 1e6 per second) to one particular shard at a time on average. Not sure yet if that's an idea we can borrow directly but perhaps there is something there... Regarding changes feed being a bottleneck for indexing, we did a quick and dirty test by reading 1M and 10M changes on a busy cluster (3 storage nodes) and we were able to get about 58-64k rows/sec with just an empty accumulator which counts rows. ``` {ok, Db} = fabric2_db:open(<<"perf-test-user/put_insert_1626378013">>, []). Fun = fun(_Change, Acc) -> {ok, Acc + 1} end. ([email protected])6> timer:tc(fun() -> fabric2_db:fold_changes(Db, 0, Fun, 0, [{limit, 1000000}]) end). {16550135,{ok,1000000}} ([email protected])12> timer:tc(fun() -> fabric2_db:fold_changes(Db, 0, Fun, 0, [{limit, 10000000}]) end). {156290604,{ok,10000000}} .... ``` For indexing at least, it seems that's not too bad. We'd want to probably find a way to parallelize doc fetches, and most of all concurrent index updates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
