Re: speeding up riaksearch precommit indexing

Sylvain Niles Tue, 21 Jun 2011 14:43:40 -0700

Why not write to a queue bucket with a timestamp and have a queue
processor move writes to the "final" bucket once they're over a
certain age? It can dedup/validate at that point too.



On Tue, Jun 21, 2011 at 2:26 PM, Les Mikesell <[email protected]> wrote:
> Where can I find the redis hacks that get close to clustering?  Would
> membase work with syncronous replication on a pair of nodes for a reliable
> atomic 'check and set' operation to dedup redundant data before writing to
> riak?   Conceptually I like the 'smart client' fault tolerance of
> memcache/membase and restricting it to a pair of machines would keep the
> client configuration reasonable.
>
>  -Les
>
>
> On 6/18/2011 6:54 PM, John D. Rowell wrote:
>>
>> The "real" queues like HornetQ and others can take care of this without
>> a single point of failure but it's a pain (in my opinion) to set them up
>> that way, and usually with all the cluster and failover features active
>> they get quite slow for writes.We use Redis for this because it's
>> simpler and lightweight. The problem is that there is no real clustering
>> option for Redis today, even thought there are some hacks that get
>> close. When we cannot afford a single point of failure or any downtime,
>> we tend to use MongoDB for simple queues. It has full cluster support
>> and the performance is pretty close to what you get with Redis in this
>> use case.
>>
>> OTOH you could keep it all Riak and setup a separate small cluster with
>> a RAM backend and use that as a queue, probably with similar
>> performance. The idea here is that you can scale these clusters (the
>> "queue" and the indexed production data) independently in response to
>> your load patterns, and have optimum hardware and I/O specs for the
>> different cluster nodes.
>>
>> -jd
>>
>> 2011/6/18 Les Mikesell <[email protected]
>> <mailto:[email protected]>>
>>
>>    Is there a good way to handle something like this with redundancy
>>    all the way through?  On simple key/value items you could have two
>>    readers write the same things to riak and let bitcask cleanup
>>    eventually discard one, but with indexing you probably need to use
>>    some sort of failover approach up front.  Do any of those queue
>>    managers handle that without adding their own single point of
>>    failure?  Assuming there are unique identifiers in the items being
>>    written, you might use the CAS feature of redis to arbitrate writes
>>    into its queue, but what happens when the redis node fails?
>>
>>      -Les
>>
>>
>>
>>    On 6/17/11 11:48 PM, John D. Rowell wrote:
>>
>>        Why not decouple the twitter stream processing from the
>>        indexing? More than
>>        likely you have a single process consuming the spritzer stream,
>>        so you can put
>>        the fetched results in a queue (hornetq, beanstalk, or even a
>>        simple Redis
>>        queue) and then have workers pull from the queue and insert into
>>        Riak. You could
>>        run one worker per node and thus insert in parallel into all
>>        nodes. If you need
>>        free CPU (e.g. for searches), just throttle the workers to some
>>        sane level. If
>>        you see the queue getting bigger, add another Riak node (and
>>        thus another local
>>        worker).
>>
>>        -jd
>>
>>        2011/6/13 Steve Webb <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>>
>>
>>
>>            Ok, I've changed my two VMs to each have:
>>
>>            3 CPUs, 1GB ram, 120GB disk
>>
>>            I'm ingesting the twitter spritzer stream (about 10-20
>>        tweets per second,
>>            approx 2k of data per tweet).  One bucket is storing the
>>        non-indexed tweets
>>            in full.  Another bucket is storing the indexed tweet
>>        string, id, date and
>>            username.  A maximum of 20 clients can be hitting the
>>        'cluster' at any one time.
>>
>>            I'm using n_val=2 so there is replication going on behind
>>        the scenes.
>>
>>            I'm using a hardware load-balancer to distribute the work
>>        amongst the two
>>            nodes and now I'm seeing about 75% CPU usage as opposed to
>>        100% on one node
>>            and 50% on the replicating-only node.
>>
>>            I've monitored the VM over the last few days and it seems to
>>        be mostly
>>            CPU-bound.  The disk I/O is low.  The Network I/O is low.
>>
>>            Q: Can I change the pre-commit to a post-commit trigger or
>>        something perhaps
>>            or will that make any difference at all?  I'm ok if the
>>        tweet stuff doesn't
>>            get indexed immediately and there's a slight lag in indexing
>>        if it saves on CPU.
>>
>>
>>
>>
>>        _________________________________________________
>>        riak-users mailing list
>>        [email protected] <mailto:[email protected]>
>>
>>  http://lists.basho.com/__mailman/listinfo/riak-users___lists.basho.com
>>
>>  <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
>>
>>
>>
>>    _________________________________________________
>>    riak-users mailing list
>>    [email protected] <mailto:[email protected]>
>>    http://lists.basho.com/__mailman/listinfo/riak-users___lists.basho.com
>>    <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
>>
>>
>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: speeding up riaksearch precommit indexing

Reply via email to