Re: MapReduce scalability

Christian Dahlqvist Thu, 28 Feb 2013 01:33:36 -0800

Hi Boris,

Apart from not scaling quite as well as straight K/V access, emulating multiGET 
through MapReduce also has another significant drawback. MapReduce has no 
concept of quorum reads, and only work on a single copy of the data, which can 
be thought of basically as a read with R=1 that does not trigger read-repair. 
It is therefore possible that it can give inconsistent or incorrect results if 
all replicas do not have the same data. It is worth noting that MapReduce was 
designed as a way to efficiently spread compute work across the cluster, and 
re-appropriating it for use with data collection is not its designed purpose.


The recommended way to implement efficient multiget is to perform normal GET 
operations in parallel. If you are retrieving 20 objects, you don't necessarily 
need to do all 20 GETs in parallel, but could set it up to use perhaps 3 or 4 
connections. If you then pair this with a connection pool that can grow and 
shrink in size (perhaps between a minimum and a maximum value) as load 
requires, you should be able to retrieve the objects in a reasonable time 
without overloading the cluster.

Best regards,

Christian


On 27 Feb 2013, at 02:18, Boris Okner <[email protected]> wrote:

> Thanks Christian,
> 
> The problem I'm trying to solve is to find the way to retrieve values for 
> limited number of keys with the best possible latency (or maybe with decent 
> latency which is balanced with decent throughput). Let's say we have keys 
> stored in some cache 
> on top of Riak, and want to retrieve values, 20 at the time, to be able to 
> implement pagination. Another alternative to mapreduce would to send multiple 
> asynchronous gets, but then we'd have to worry about connection pool being 
> exhausted if there's too many such "page" requests. So what would be the 
> proper way to deal with the situation when we need to emulate multiple key 
> retrieval?
> 
> On Tue, Feb 26, 2013 at 1:57 AM, Christian Dahlqvist <[email protected]> 
> wrote:
> Hi Boris,
> 
> MapReduce is a very flexible and powerful way of querying Riak and allows 
> processing to be performed locally where the data resides, which allows for 
> efficient processing of larger data sets. A result of this is that every 
> mapreduce job requires a covering set of vnodes (all vnodes that hold the 
> data required for processing) to participate, meaning that it puts 
> considerable more load on the system compared to straight K/V access and 
> therefore does not scale quite as well. It is primarily designed for batch 
> type processing over reasonably large amounts of data and scales well with 
> increased data volumes as new nodes are added. We do however usually not 
> recommended using it as an interface for realtime queries where low and 
> predictable latencies are required and the concurrency level, and therefore 
> load level on the cluster, can not be controlled.
> 
> I am not sure I understand what you mean by the performance degrading with 
> the number of nodes, unless you are strictly measuring latency rather than 
> throughput. As the number of nodes increase, it gets more and more likely 
> that multiple physical nodes will be involved in the job, which will add to 
> the amount of communication and coordination required between the nodes, 
> thereby increasing latency. Could you please explain in more detail what you 
> are trying to achieve?
> 
> Best regards,
> 
> Christian 
> 
> 
> On 25 Feb 2013, at 16:41, Boris Okner <[email protected]> wrote:
> 
>> Hello,
>> 
>> I'm experimenting with 2 Riak 1.3.0 nodes (both are "bare metal"), and it 
>> looks like mapreduce performs better when one of the nodes is down. The 
>> mapreduce requests are running on 20-key blocks. So am I doing something 
>> wrong, or is it an expected behaviour, i.e. mapreduce degrades with the the 
>> number of nodes increased? If the former, could 
>> you give me some pointers on how to set up it to get advantage of multiple 
>> nodes?
>> 
>> Thanks in advance for your help,
>> Boris
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: MapReduce scalability

Reply via email to