Re: Riak 2i http query much faster than python api?

Jeff Peck Wed, 10 Apr 2013 19:50:49 -0700

> Out of curiousity, how are you planning on segmenting the data?
> 

My plan to segment the data would be to have a secondary index on a key called 
seg_id (or something similar).


When I add an object to Riak, I will set seg_id to be the first three 
characters of the md5 of the object's key, which should yield an even 
distribution.

Then, when querying the data, I will run map-reduce against each segment (so 
for 3 hexadecimal characters, it would be 4,096 map-reduce queries).

The inputs part of the query would look like this:

"inputs":{
       "bucket":"mybucket",
       "index":"seg_id_bin",
       "key":"aaa"
    }

I would run the map-reduce queries in parallel.

It sounds like a lot of work to just get the value of one field, which makes me 
think that there is a better way. Plus, I do not know that this will actually 
work as fast as I expect it to. That's why I'm asking here before I implement 
it.
> Also, how are you setting up your servers? Single nodes? Multiple nodes?
> 

I am using the default Riak installation (with leveldb as the backend and 
search turned on). I am on a 16 core 3Ghz node with 20Gb of memory, however it 
appears that Riak is not using all of the resources available to it. I suspect 
that this can be resolved by modifying the configuration

That said, if you, or anyone reading this, could suggest a configuration that 
is more suited for performing a relatively small batch operation across 900k 
(and soon to be about 5 million) or objects, that would be greatly appreciated.

Thanks!

- Jeff


On Apr 10, 2013, at 10:32 PM, Shuhao Wu <[email protected]> wrote:

> Out of curiousity, how are you planning on segmenting the data? Map reduce 
> will execute over the entire data set.
> 
> Also, how are you setting up your servers? Single nodes? Multiple nodes?
> 
> Shuhao
> Sent from my phone.
> 
> On 2013-04-10 10:25 PM, "Jeff Peck" <[email protected]> wrote:
> As a follow-up to this thread and my thread from earlier today, I am 
> basically looking for a simple way to extract the value of a single field 
> from approximately 900,000 documents (which happens to be indexed). I have 
> been trying many options including a map-reduce function that executes 
> entirely over http (taking out any python client bottlenecks). I let that run 
> for over an hour before I stopped it. It did not return any output.
> 
> I also have tried grabbing a list of the 900k keys from a secondary index 
> (very fast, about 11 seconds) and then trying to fetch each key in parallel 
> (using curl and gnu parallel). That was also too slow to be feasible.
> 
> Is there something basic that I am missing?
> 
> One idea that I though of was to have a secondary index that is intended to 
> split all of my data into segments. I would use the first three characters of 
> the md5 of the document's key in hexadecimal format. So, the index would 
> contain strings like "ae1", "2f4", "5ee", etc. Then, I can run my map-reduce 
> query against *each* segment individually and possibly even in parallel.
> 
> I have observed that map-reduce is very fast with small sets of data (i.e. 
> 5,000 objects), but with 900,000 objects it does not appear to run in a 
> proportionately fast time. So, the idea is to divide the data into segments 
> that can be better handled by map-reduce.
> 
> Before I implement this, I want to ask: Does this seem like the appropriate 
> way to handle this type of operation? And, is there any better way to do this 
> in the current version of Riak?
> 
> 
> On Apr 10, 2013, at 6:10 PM, Shuhao Wu <[email protected]> wrote:
> 
>> There are some inefficiencies in the python client... I've been profiling it 
>> recently and found that it occasionally takes the python client longer when 
>> you're on the same machine.
>> 
>> Perhaps Sean could comment?
>> 
>> Shuhao
>> Sent from my phone.
>> 
>> On 2013-04-10 4:04 PM, "Jeff Peck" <[email protected]> wrote:
>> Thanks Evan. I tried doing it in python like this (realizing that the 
>> previous way I did it uses MapReduce) and I had better results. It finished 
>> in 3.5 minutes, but nowhere close to the 15 seconds from the straight http 
>> query:
>> 
>> import riak
>> from pprint import pprint
>> 
>> bucket_name = "mybucket"
>> 
>> client = riak.RiakClient(port=8087,transport_class=riak.RiakPbcTransport)
>> bucket = client.bucket(bucket_name)
>> results = bucket.get_index('status_bin', 'PERSISTED')
>> 
>> print len(results)
>> 
>> 
>> On Apr 10, 2013, at 4:00 PM, Evan Vigil-McClanahan <[email protected]> 
>> wrote:
>> 
>> > get_index() is the right function there, I think.
>> >
>> > On Wed, Apr 10, 2013 at 2:53 PM, Jeff Peck <[email protected]> wrote:
>> >> I can grab over 900,000 keys from an indexs, using an http query in about 
>> >> 15 seconds, whereas the same operation in python times out after 5 
>> >> minutes. Does this indicate that I am using the python API incorrectly? 
>> >> Should I be relying on an http request initially when I need to grab this 
>> >> many keys?
>> >>
>> >> (Note: This is tied to the question that I asked earlier, but is also a 
>> >> general question to help understand the proper usage of the python API.)
>> >>
>> >> Thanks! Examples are below.
>> >>
>> >> - Jeff
>> >>
>> >> ---
>> >>
>> >> HTTP:
>> >>
>> >> $ time curl -s 
>> >> http://localhost:8098/buckets/mybucket/index/status_bin/PERSISTED | grep 
>> >> -o , | wc -l
>> >> 926047
>> >>
>> >> real    0m14.583s
>> >> user    0m2.500s
>> >> sys     0m0.270s
>> >>
>> >> ---
>> >>
>> >> Python:
>> >>
>> >> import riak
>> >>
>> >> bucket = "my bucket"
>> >> client = riak.RiakClient(port=8098)
>> >> results = client.index(bucket, 'status_bin', 
>> >> 'PERSISTED').run(timeout=5*60*1000) # 5 minute timeout
>> >> print len(results)
>> >>
>> >> (times out after 5 minutes)
>> >> _______________________________________________
>> >> riak-users mailing list
>> >> [email protected]
>> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak 2i http query much faster than python api?

Reply via email to