If you want to avoid overwhelming the cluster, there's an easy trick you can try.
In the RDBMS world, we frequently batch requests and issue 5,000 or 10,000 updates at once. We then put a delay in the code of a second or so after the commit so that the transaction log has a second to catch up. This is really so that the checkpoint flushing dirty pages from memory doesn't run into I/O contention with the log writer. Updating by time instead of key will have a double I/O hit for the lookup, but you can easily control the order of updates. I say go for it. --- Jeremiah Peschka, Managing Director, Brent Ozar PLF, LLC Microsoft SQL Server MVP On Feb 28, 2012, at 7:34 AM, Jeremy Raymond <[email protected]> wrote: > Hi, > > I need to reindex a bucket with a ~4 million items. If I do a > streaming list keys using the Erlang client and then read/write the > items as they keys come in it puts too much load on the cluster and > other mapred queries that get run timeout. I already have a date based > index on the items and was thinking getting items based on hourly > chunks and update them in batches that way as the times should be > relatively evenly distributed in time. I can then better control the > flow and load of the reindexing operations. > > Anyone have any better ideas or use any other strategies when having to > reindex? > > -- > Jeremy > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
