--- Jeremiah Peschka - Founder, Brent Ozar Unlimited MCITP: SQL Server 2008, MVP Cloudera Certified Developer for Apache Hadoop
On Sun, Jan 26, 2014 at 10:27 PM, fxmy wang <[email protected]> wrote: > Thanks for the response, Jeremiah. > > > > > > Then here are my questions: > > > 1) To get better writing throughput, is it right to set the w=1? > > > > This will improve perceived throughput at the client, but it won't > improve throughput at the server. > > Thank you for clarifying this for me :D > > > > 2) What's the best way to query these comments? In this use case, I > don't need to retrieve all the comments in one bucket, but just the latest > few hundreds comments( if there are so many) based on the time they are > posted. > > > > > > Right now I'm thinking of using line-walking and keeping track of the > latest comment so I can trace backwards to get the latest 500 comments ( > for example). And when new comment comes, point the line to the old latest, > then update new latest comment mark. > > > > > > > I wouldn't use link-walking. IIRC this uses MapReduce under the covers. > You could use a single key to store the most recent comment. > > What's bad about MapReduce? > Since there will be another cache layer lays on top of the cluster, so > the read operation is relatively quite infrequent. That's why I choose > to use link-walking. > Even when you run a MapReduce query over a single bucket, MapReduce has to contact a majority of nodes in the cluster to perform a coverage query. In effect, you're scanning all of the keys to make sure you find only the keys in a single bucket. MapReduce can work for limited scenarios (e.g. mutating the state of a large number of objects or running batched analytics that write to a separate set of buckets/keys) but people have reported unsatisfactory results when trying to use MapReduce for live querying. This sort of thing may be possible with the Riak Search 2.0 functionality as well. I haven't played around with it enough to know whether it would be a good fit or not. > > > You can get the most recent n keys using secondary index queries on the > $bucket index, sorting, and pagination. > I'm not sure what you mean here =.= > How can I query most recent n keys using 2i ? Should I put timestamp > -----like by every hour----- in 2i on the coming comments , then when > it comes to queries, just try to query 2i by the hour segment? This > seems a little blind because some videos could be long time before got > commented again. Querying based on time segmentation seems like > shooting in the dark to me :\ > "Keys will consist of a timestamp and userID." Sounds like you could sort on that to me. The $bucket index is a special index that only contains a list of the keys in a bucket. Querying $bucket is cheaper than a list keys operation. There are a number of ways you can solve this problem that are all implementation dependent. > > And doc says listing keys operation should not used in production, so > it's a no go either :\ > A list keys is not a $bucket index query. See "Retrieve all Bucket Keys via $bucket Index" at http://docs.basho.com/riak/latest/dev/using/2i/ > > > > > So in the scenario above, is it possible that after one client has > written on nodeA ,modified the latest-mark and another client on nodeB not > yet sees the change thus points the line to the old comment, resulting a > "branch" in the line? > > > If this could happen, then what can be done to avoid it? Are there any > better ways to store&query those comments? Any reply is appreciated. > > > > You can avoid siblings by serializing all of your writes through a > single writer. That's not a great idea since you lose many of Riak's > benefits. > > You could also use a CRDT with a register type. These tend toward the > last writer. > > My goal is to form kind of a single-line-relationship based on > timestamp through the keys under high concurrent write pressure. And > through this relationship I can easily pick out the last > hundreds/thousands comments. > As Jeremiah said, serializing all of writes through a single writer > can avoid siblings totally. And note that we don't have key clashing > problems here ------ every comment holds an unique key. What we want > is single-line-relationship. So how about this: > > Multiple erlang-pb clients just do the writes and don't care about the > lining up. > Using post-commit hooks to notify one special global registered > process( which should be running in the riak cluster?) that "here > comes a new comment, line it up when it's appropriate". > Is this feasible? And if it is , how should i prepare for the cluster > partition & rejoin scenario when network fails? > It sounds to me like you're doing an awful lot of work to do something that a relational database handles remarkably well. > > > The point is that you need to decide how you want to deal with this type > of scenario - it's going to happen. In a worst case; you lose a write > briefly. > > Hopefully the method above could avoid this :) > > Please everyone, share your thoughts please. _(:3JZ)_ > > B.R. >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
