Github user keith-turner commented on a diff in the pull request: https://github.com/apache/incubator-fluo-recipes/pull/130#discussion_r114650888 --- Diff: docs/combine-queue.md --- @@ -0,0 +1,224 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> +# Combine Queue Recipe + +## Background + +When many transactions try to modify the same keys, collisions will occur. Too many collisions +cause transactions to fail and throughput to nose dive. For example, consider [phrasecount] +which has many transactions processing documents. Each transaction counts the phrases in a document +and then updates global phrase counts. Since transaction attempts to update many phrases +, the probability of collisions is high. + +## Solution + +The [combine queue recipe][CombineQueue] provides a reusable solution for updating many keys while +avoiding collisions. The recipe also organizes updates into batches in order to improve throughput. + +This recipes queues updates to keys for other transactions to process. In the phrase count example +transactions processing documents queue updates, but do not actually update the counts. Below is an +example of computing phrasecounts using this recipe. + + * TX1 queues `+1` update for phrase `we want lambdas now` + * TX2 queues `+1` update for phrase `we want lambdas now` + * TX3 reads the updates and current value for the phrase `we want lambdas now`. There is no current value and the updates sum to 2, so a new value of 2 is written. + * TX4 queues `+2` update for phrase `we want lambdas now` + * TX5 queues `-1` update for phrase `we want lambdas now` + * TX6 reads the updates and current value for the phrase `we want lambdas now`. The current value is 2 and the updates sum to 1, so a new value of 3 is written. + +Transactions processing updates have the ability to make additional updates. +For example in addition to updating the current value for a phrase, the new +value could also be placed on an export queue to update an external database. + +### Buckets + +A simple implementation of this recipe would have an update queue for each key. However the +implementation is slightly more complex. Each update queue is in a bucket and transactions process +all of the updates in a bucket. This allows more efficient processing of updates for the following +reasons : + + * When updates are queued, notifications are made per bucket(instead of per a key). + * The transaction doing the update can scan the entire bucket reading updates, this avoids a seek for each key being updated. + * Also the transaction can request a batch lookup to get the current value of all the keys being updated. + * Any additional actions taken on update (like adding something to an export queue) can also be batched. + * Data is organized to make reading exiting values for keys in a bucket more efficient. + +Which bucket a key goes to is decided using hash and modulus so that multiple updates for a key go +to the same bucket. + +The initial number of tablets to create when applying table optimizations can be controlled by +setting the buckets per tablet option when configuring a Combine Queue. For example if you +have 20 tablet servers and 1000 buckets and want 2 tablets per tserver initially then set buckets +per tablet to 1000/(2*20)=25. + +## Example Use + +The following code snippets show how to use this recipe for wordcount. The first step is to +configure it before initializing Fluo. When initializing an ID is needed. This ID is used in two +ways. First, the ID is used as a row prefix in the table. Therefore nothing else should use that +row range in the table. Second, the ID is used in generating configuration keys. + +The following snippet shows how to configure a combine queue. + +```java + FluoConfiguration fluoConfig = ...; + + // Set application properties for the combine queue. These properties are read later by + // the observers running on each worker. + CombineQueue.configure(WcObserverProvider.ID) + .keyType(String.class).valueType(Long.class).buckets(119).finish(fluoConfig); --- End diff -- I like `store()` also thinking of `set()` ```java CombineQueue.configure(WcObserverProvider.ID) .keyType(String.class).valueType(Long.class).buckets(119).store(fluoConfig); ``` ```java CombineQueue.configure(WcObserverProvider.ID) .keyType(String.class).valueType(Long.class).buckets(119).set(fluoConfig); ```
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---