> Out of curiousity, how are you planning on segmenting the data?
>
My plan to segment the data would be to have a secondary index on a key called
seg_id (or something similar).
When I add an object to Riak, I will set seg_id to be the first three
characters of the md5 of the object's key, which should yield an even
distribution.
Then, when querying the data, I will run map-reduce against each segment (so
for 3 hexadecimal characters, it would be 4,096 map-reduce queries).
The inputs part of the query would look like this:
"inputs":{
"bucket":"mybucket",
"index":"seg_id_bin",
"key":"aaa"
}
I would run the map-reduce queries in parallel.
It sounds like a lot of work to just get the value of one field, which makes me
think that there is a better way. Plus, I do not know that this will actually
work as fast as I expect it to. That's why I'm asking here before I implement
it.
> Also, how are you setting up your servers? Single nodes? Multiple nodes?
>
I am using the default Riak installation (with leveldb as the backend and
search turned on). I am on a 16 core 3Ghz node with 20Gb of memory, however it
appears that Riak is not using all of the resources available to it. I suspect
that this can be resolved by modifying the configuration
That said, if you, or anyone reading this, could suggest a configuration that
is more suited for performing a relatively small batch operation across 900k
(and soon to be about 5 million) or objects, that would be greatly appreciated.
Thanks!
- Jeff
On Apr 10, 2013, at 10:32 PM, Shuhao Wu <[email protected]> wrote:
> Out of curiousity, how are you planning on segmenting the data? Map reduce
> will execute over the entire data set.
>
> Also, how are you setting up your servers? Single nodes? Multiple nodes?
>
> Shuhao
> Sent from my phone.
>
> On 2013-04-10 10:25 PM, "Jeff Peck" <[email protected]> wrote:
> As a follow-up to this thread and my thread from earlier today, I am
> basically looking for a simple way to extract the value of a single field
> from approximately 900,000 documents (which happens to be indexed). I have
> been trying many options including a map-reduce function that executes
> entirely over http (taking out any python client bottlenecks). I let that run
> for over an hour before I stopped it. It did not return any output.
>
> I also have tried grabbing a list of the 900k keys from a secondary index
> (very fast, about 11 seconds) and then trying to fetch each key in parallel
> (using curl and gnu parallel). That was also too slow to be feasible.
>
> Is there something basic that I am missing?
>
> One idea that I though of was to have a secondary index that is intended to
> split all of my data into segments. I would use the first three characters of
> the md5 of the document's key in hexadecimal format. So, the index would
> contain strings like "ae1", "2f4", "5ee", etc. Then, I can run my map-reduce
> query against *each* segment individually and possibly even in parallel.
>
> I have observed that map-reduce is very fast with small sets of data (i.e.
> 5,000 objects), but with 900,000 objects it does not appear to run in a
> proportionately fast time. So, the idea is to divide the data into segments
> that can be better handled by map-reduce.
>
> Before I implement this, I want to ask: Does this seem like the appropriate
> way to handle this type of operation? And, is there any better way to do this
> in the current version of Riak?
>
>
> On Apr 10, 2013, at 6:10 PM, Shuhao Wu <[email protected]> wrote:
>
>> There are some inefficiencies in the python client... I've been profiling it
>> recently and found that it occasionally takes the python client longer when
>> you're on the same machine.
>>
>> Perhaps Sean could comment?
>>
>> Shuhao
>> Sent from my phone.
>>
>> On 2013-04-10 4:04 PM, "Jeff Peck" <[email protected]> wrote:
>> Thanks Evan. I tried doing it in python like this (realizing that the
>> previous way I did it uses MapReduce) and I had better results. It finished
>> in 3.5 minutes, but nowhere close to the 15 seconds from the straight http
>> query:
>>
>> import riak
>> from pprint import pprint
>>
>> bucket_name = "mybucket"
>>
>> client = riak.RiakClient(port=8087,transport_class=riak.RiakPbcTransport)
>> bucket = client.bucket(bucket_name)
>> results = bucket.get_index('status_bin', 'PERSISTED')
>>
>> print len(results)
>>
>>
>> On Apr 10, 2013, at 4:00 PM, Evan Vigil-McClanahan <[email protected]>
>> wrote:
>>
>> > get_index() is the right function there, I think.
>> >
>> > On Wed, Apr 10, 2013 at 2:53 PM, Jeff Peck <[email protected]> wrote:
>> >> I can grab over 900,000 keys from an indexs, using an http query in about
>> >> 15 seconds, whereas the same operation in python times out after 5
>> >> minutes. Does this indicate that I am using the python API incorrectly?
>> >> Should I be relying on an http request initially when I need to grab this
>> >> many keys?
>> >>
>> >> (Note: This is tied to the question that I asked earlier, but is also a
>> >> general question to help understand the proper usage of the python API.)
>> >>
>> >> Thanks! Examples are below.
>> >>
>> >> - Jeff
>> >>
>> >> ---
>> >>
>> >> HTTP:
>> >>
>> >> $ time curl -s
>> >> http://localhost:8098/buckets/mybucket/index/status_bin/PERSISTED | grep
>> >> -o , | wc -l
>> >> 926047
>> >>
>> >> real 0m14.583s
>> >> user 0m2.500s
>> >> sys 0m0.270s
>> >>
>> >> ---
>> >>
>> >> Python:
>> >>
>> >> import riak
>> >>
>> >> bucket = "my bucket"
>> >> client = riak.RiakClient(port=8098)
>> >> results = client.index(bucket, 'status_bin',
>> >> 'PERSISTED').run(timeout=5*60*1000) # 5 minute timeout
>> >> print len(results)
>> >>
>> >> (times out after 5 minutes)
>> >> _______________________________________________
>> >> riak-users mailing list
>> >> [email protected]
>> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com