Re: allow_mult vs. 2i

Jeremiah Peschka Wed, 25 Sep 2013 15:33:12 -0700

inline.

---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop

On Wed, Sep 25, 2013 at 2:47 PM, Brady Wetherington <[email protected]>wrote:

> I've built it a solid proof-of-concept system on leveldb, and use some 2i
> indexes in order to search for certain things - usually just for counts of
> things.
>
> I have two questions so far:
>
> First off, why is Bitcask the default? Is it just because it is faster? Or
> is it considered more 'stable' or something?
>

Long ago, when bitcask was elected as the default, LevelDB was not a thing.

Databases strive for stability and the principle of least surprise.
Changing anything can potentially introduce performance regressions,
stability problems, and any host of other undesirable and reputation
destroying things.

Changing the storage back end is high up on the list of things I'd never
want to do in a database. Why do you think MySQL still defaults to MyISAM?

>
> Next, I've learned about the allow_mult feature you can set on buckets. I
> wonder if I should use this for my most heavily-used primary-purpose
> queries? Is there a limit to how many 'siblings' you can have for an entry?
> Is it inadvisable to do what I'm talking about? Would fetching all of the
> siblings end up being a disastrous nightmare or something?
>

The upper limit will depend on the size of your objects. You don't want to
have object sizes (including siblings) much beyond 6MB. You'll have a lot
of network congestion. You certainly *could* have bigger object + sibling
collections, but you'd want to beef up the network backend to something
like 10GbE, 40GbE, or InfiniBand to deal with the increased gossip.

Fetching all of your siblings is bad if you never resolve siblings since
you'll have a lot of data.

Allow_mult is typically turned on for production clusters. This is set off
by default to help new users get a handle on Riak quickly without having to
worry about siblings. Once you get the hang of how Riak behaves, turning on
siblings is usually a good thing.

Depending on resolution, it's probably best to read your data, resolve
siblings, and send that garbage collected object back to Riak - even if
you're performing a "read only" query. The new Riak DT features eliminate
some of the worry about siblings by pushing the responsibility back down to
Riak. Those features are only available if you're building from source, but
hopefully Riak 2.0 will be out soon.

> I *assume* - and I could be wrong - that a 2i query would be slower than a
> fetch-of-siblings for a particular key - is that wrong?
>
> If I switch from using 2i indexes to using allow_mult and siblings, we'd
> be talking a few hundred thousand to low millions for a sibling-count.
>

I do not think 'siblings' means what you think it means.

A sibling would occur if two clients, A and B, read v1 of an object and
then issue writes.

Client A updates object and sets preferences to ['cat pictures', 'ham
sandwiches']
Client B updates object and sets preferences to ['knitting with bacon']

With allow_mult enabled you'd have two versions of the object. These are
siblings.

If you're thinking of some kind of index created by your application, you
could look at 2i vs using siblings to build a secondary index:
http://basho.com/index-for-fun-and-for-profit/ Even when you're creating
your own secondary index, you still want to perform garbage collection on
the data you're storing in Riak.

> Thanks for making an excellent product! Can't wait to get this bad boy
> into production and really see what it can do!
>
> -B.
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: allow_mult vs. 2i

Reply via email to