Repository: accumulo-website Updated Branches: refs/heads/asf-site 2c7f1e8cd -> 9ebc5f9a1 refs/heads/master 29778dd0d -> 817a0ef72
Organized documentation * Moved iterator_testing.md content to development_tools.md * Moved proxy docs from client.md to new proxy.md * Renamed analytics.md to mapreduce.md and moved combiner docs to iterators.md * Reordered development docs Project: http://git-wip-us.apache.org/repos/asf/accumulo-website/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo-website/commit/817a0ef7 Tree: http://git-wip-us.apache.org/repos/asf/accumulo-website/tree/817a0ef7 Diff: http://git-wip-us.apache.org/repos/asf/accumulo-website/diff/817a0ef7 Branch: refs/heads/master Commit: 817a0ef7238c66bd48dcb184470998b3a1463b19 Parents: 29778dd Author: Mike Walch <[email protected]> Authored: Fri May 26 10:12:21 2017 -0400 Committer: Mike Walch <[email protected]> Committed: Fri May 26 10:12:21 2017 -0400 ---------------------------------------------------------------------- _docs-unreleased/development/analytics.md | 226 ---------- .../development/development_tools.md | 96 ++++- .../development/high_speed_ingest.md | 2 +- _docs-unreleased/development/iterator_design.md | 386 ----------------- .../development/iterator_testing.md | 97 ----- _docs-unreleased/development/iterators.md | 419 +++++++++++++++++++ _docs-unreleased/development/mapreduce.md | 181 ++++++++ _docs-unreleased/development/proxy.md | 121 ++++++ _docs-unreleased/development/sampling.md | 2 +- _docs-unreleased/development/security.md | 2 +- _docs-unreleased/development/summaries.md | 2 +- _docs-unreleased/getting-started/clients.md | 120 +----- 12 files changed, 828 insertions(+), 826 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/analytics.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/analytics.md b/_docs-unreleased/development/analytics.md deleted file mode 100644 index e579bf6..0000000 --- a/_docs-unreleased/development/analytics.md +++ /dev/null @@ -1,226 +0,0 @@ ---- -title: Analytics -category: development -order: 8 ---- - -Accumulo supports more advanced data processing than simply keeping keys -sorted and performing efficient lookups. Analytics can be developed by using -MapReduce and Iterators in conjunction with Accumulo tables. - -## MapReduce - -Accumulo tables can be used as the source and destination of MapReduce jobs. To -use an Accumulo table with a MapReduce job (specifically with the new Hadoop API -as of version 0.20), configure the job parameters to use the AccumuloInputFormat -and AccumuloOutputFormat. Accumulo specific parameters can be set via these -two format classes to do the following: - -* Authenticate and provide user credentials for the input -* Restrict the scan to a range of rows -* Restrict the input to a subset of available columns - -### Mapper and Reducer classes - -To read from an Accumulo table create a Mapper with the following class -parameterization and be sure to configure the AccumuloInputFormat. - -```java -class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> { - public void map(Key k, Value v, Context c) { - // transform key and value data here - } -} -``` - -To write to an Accumulo table, create a Reducer with the following class -parameterization and be sure to configure the AccumuloOutputFormat. The key -emitted from the Reducer identifies the table to which the mutation is sent. This -allows a single Reducer to write to more than one table if desired. A default table -can be configured using the AccumuloOutputFormat, in which case the output table -name does not have to be passed to the Context object within the Reducer. - -```java -class MyReducer extends Reducer<WritableComparable, Writable, Text, Mutation> { - public void reduce(WritableComparable key, Iterable<Text> values, Context c) { - Mutation m; - // create the mutation based on input key and value - c.write(new Text("output-table"), m); - } -} -``` - -The Text object passed as the output should contain the name of the table to which -this mutation should be applied. The Text can be null in which case the mutation -will be applied to the default table name specified in the AccumuloOutputFormat -options. - -### AccumuloInputFormat options - -```java -Job job = new Job(getConf()); -AccumuloInputFormat.setInputInfo(job, - "user", - "passwd".getBytes(), - "table", - new Authorizations()); - -AccumuloInputFormat.setZooKeeperInstance(job, "myinstance", - "zooserver-one,zooserver-two"); -``` - -**Optional Settings:** - -To restrict Accumulo to a set of row ranges: - -```java -ArrayList<Range> ranges = new ArrayList<Range>(); -// populate array list of row ranges ... -AccumuloInputFormat.setRanges(job, ranges); -``` - -To restrict Accumulo to a list of columns: - -```java -ArrayList<Pair<Text,Text>> columns = new ArrayList<Pair<Text,Text>>(); -// populate list of columns -AccumuloInputFormat.fetchColumns(job, columns); -``` - -To use a regular expression to match row IDs: - -```java -IteratorSetting is = new IteratorSetting(30, RexExFilter.class); -RegExFilter.setRegexs(is, ".*suffix", null, null, null, true); -AccumuloInputFormat.addIterator(job, is); -``` - -### AccumuloMultiTableInputFormat options - -The AccumuloMultiTableInputFormat allows the scanning over multiple tables -in a single MapReduce job. Separate ranges, columns, and iterators can be -used for each table. - -```java -InputTableConfig tableOneConfig = new InputTableConfig(); -InputTableConfig tableTwoConfig = new InputTableConfig(); -``` - -To set the configuration objects on the job: - -```java -Map<String, InputTableConfig> configs = new HashMap<String,InputTableConfig>(); -configs.put("table1", tableOneConfig); -configs.put("table2", tableTwoConfig); -AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs); -``` - -**Optional settings:** - -To restrict to a set of ranges: - -```java -ArrayList<Range> tableOneRanges = new ArrayList<Range>(); -ArrayList<Range> tableTwoRanges = new ArrayList<Range>(); -// populate array lists of row ranges for tables... -tableOneConfig.setRanges(tableOneRanges); -tableTwoConfig.setRanges(tableTwoRanges); -``` - -To restrict Accumulo to a list of columns: - -```java -ArrayList<Pair<Text,Text>> tableOneColumns = new ArrayList<Pair<Text,Text>>(); -ArrayList<Pair<Text,Text>> tableTwoColumns = new ArrayList<Pair<Text,Text>>(); -// populate lists of columns for each of the tables ... -tableOneConfig.fetchColumns(tableOneColumns); -tableTwoConfig.fetchColumns(tableTwoColumns); -``` - -To set scan iterators: - -```java -List<IteratorSetting> tableOneIterators = new ArrayList<IteratorSetting>(); -List<IteratorSetting> tableTwoIterators = new ArrayList<IteratorSetting>(); -// populate the lists of iterator settings for each of the tables ... -tableOneConfig.setIterators(tableOneIterators); -tableTwoConfig.setIterators(tableTwoIterators); -``` - -The name of the table can be retrieved from the input split: - -```java -class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> { - public void map(Key k, Value v, Context c) { - RangeInputSplit split = (RangeInputSplit)c.getInputSplit(); - String tableName = split.getTableName(); - // do something with table name - } -} -``` - -### AccumuloOutputFormat options - -```java -boolean createTables = true; -String defaultTable = "mytable"; - -AccumuloOutputFormat.setOutputInfo(job, - "user", - "passwd".getBytes(), - createTables, - defaultTable); - -AccumuloOutputFormat.setZooKeeperInstance(job, "myinstance", - "zooserver-one,zooserver-two"); -``` - -**Optional Settings:** - -```java -AccumuloOutputFormat.setMaxLatency(job, 300000); // milliseconds -AccumuloOutputFormat.setMaxMutationBufferSize(job, 50000000); // bytes -``` - -The [MapReduce example](https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md) -contains a complete example of using MapReduce with Accumulo. - -## Combiners - -Many applications can benefit from the ability to aggregate values across common -keys. This can be done via Combiner iterators and is similar to the Reduce step in -MapReduce. This provides the ability to define online, incrementally updated -analytics without the overhead or latency associated with batch-oriented -MapReduce jobs. - -All that is needed to aggregate values of a table is to identify the fields over which -values will be grouped, insert mutations with those fields as the key, and configure -the table with a combining iterator that supports the summarizing operation -desired. - -The only restriction on an combining iterator is that the combiner developer -should not assume that all values for a given key have been seen, since new -mutations can be inserted at anytime. This precludes using the total number of -values in the aggregation such as when calculating an average, for example. - -### Feature Vectors - -An interesting use of combining iterators within an Accumulo table is to store -feature vectors for use in machine learning algorithms. For example, many -algorithms such as k-means clustering, support vector machines, anomaly detection, -etc. use the concept of a feature vector and the calculation of distance metrics to -learn a particular model. The columns in an Accumulo table can be used to efficiently -store sparse features and their weights to be incrementally updated via the use of an -combining iterator. - -## Statistical Modeling - -Statistical models that need to be updated by many machines in parallel could be -similarly stored within an Accumulo table. For example, a MapReduce job that is -iteratively updating a global statistical model could have each map or reduce worker -reference the parts of the model to be read and updated through an embedded -Accumulo client. - -Using Accumulo this way enables efficient and fast lookups and updates of small -pieces of information in a random access pattern, which is complementary to -MapReduce's sequential access model. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/development_tools.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/development_tools.md b/_docs-unreleased/development/development_tools.md index 3e326e2..f9768f6 100644 --- a/_docs-unreleased/development/development_tools.md +++ b/_docs-unreleased/development/development_tools.md @@ -1,7 +1,7 @@ --- title: Development Tools category: development -order: 3 +order: 4 --- Normally, Accumulo consists of lots of moving parts. Even a stand-alone version of @@ -9,6 +9,100 @@ Accumulo requires Hadoop, Zookeeper, the Accumulo master, a tablet server, etc. you want to write a unit test that uses Accumulo, you need a lot of infrastructure in place before your test can run. +## Iterator Test Harness + +Iterators, while extremely powerful, are notoriously difficult to test. While the API defines +the methods an Iterator must implement and each method's functionality, the actual invocation +of these methods by Accumulo TabletServers can be surprisingly difficult to mimic in unit tests. + +The Apache Accumulo "Iterator Test Harness" is designed to provide a generalized testing framework +for all Accumulo Iterators to leverage to identify common pitfalls in user-created Iterators. + +### Framework Use + +The harness provides an abstract class for use with JUnit4. Users must define the following for this +abstract class: + + * A `SortedMap` of input data (`Key`-`Value` pairs) + * A `Range` to use in tests + * A `Map` of options (`String` to `String` pairs) + * A `SortedMap` of output data (`Key`-`Value` pairs) + * A list of `IteratorTestCase`s (these can be automatically discovered) + +The majority of effort a user must make is in creating the input dataset and the expected +output dataset for the iterator being tested. + +### Normal Test Outline + +Most iterator tests will follow the given outline: + +```java +import java.util.List; +import java.util.SortedMap; + +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.iteratortest.IteratorTestCaseFinder; +import org.apache.accumulo.iteratortest.IteratorTestInput; +import org.apache.accumulo.iteratortest.IteratorTestOutput; +import org.apache.accumulo.iteratortest.junit4.BaseJUnit4IteratorTest; +import org.apache.accumulo.iteratortest.testcases.IteratorTestCase; +import org.junit.runners.Parameterized.Parameters; + +public class MyIteratorTest extends BaseJUnit4IteratorTest { + + @Parameters + public static Object[][] parameters() { + final IteratorTestInput input = createIteratorInput(); + final IteratorTestOutput output = createIteratorOutput(); + final List<IteratorTestCase> testCases = IteratorTestCaseFinder.findAllTestCases(); + return BaseJUnit4IteratorTest.createParameters(input, output, tests); + } + + private static SortedMap<Key,Value> INPUT_DATA = createInputData(); + private static SortedMap<Key,Value> OUTPUT_DATA = createOutputData(); + + private static SortedMap<Key,Value> createInputData() { + // TODO -- implement this method + } + + private static SortedMap<Key,Value> createOutputData() { + // TODO -- implement this method + } + + private static IteratorTestInput createIteratorInput() { + final Map<String,String> options = createIteratorOptions(); + final Range range = createRange(); + return new IteratorTestInput(MyIterator.class, options, range, INPUT_DATA); + } + + private static Map<String,String> createIteratorOptions() { + // TODO -- implement this method + // Tip: Use INPUT_DATA if helpful in generating output + } + + private static Range createRange() { + // TODO -- implement this method + } + + private static IteratorTestOutput createIteratorOutput() { + return new IteratorTestOutput(OUTPUT_DATA); + } +} +``` + +### Limitations + +While the provided `IteratorTestCase`s should exercise common edge-cases in user iterators, +there are still many limitations to the existing test harness. Some of them are: + + * Can only specify a single iterator, not many (a "stack") + * No control over provided IteratorEnvironment for tests + * Exercising delete keys (especially with major compactions that do not include all files) + +These are left as future improvements to the harness. + ## Mock Accumulo Mock Accumulo supplies mock implementations for much of the client API. It presently http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/high_speed_ingest.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/high_speed_ingest.md b/_docs-unreleased/development/high_speed_ingest.md index 7d906a0..f52f501 100644 --- a/_docs-unreleased/development/high_speed_ingest.md +++ b/_docs-unreleased/development/high_speed_ingest.md @@ -1,7 +1,7 @@ --- title: High-Speed Ingest category: development -order: 7 +order: 8 --- Accumulo is often used as part of a larger data processing and storage system. To http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/iterator_design.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/iterator_design.md b/_docs-unreleased/development/iterator_design.md deleted file mode 100644 index cfb46c8..0000000 --- a/_docs-unreleased/development/iterator_design.md +++ /dev/null @@ -1,386 +0,0 @@ ---- -title: Iterator Design -category: development -order: 1 ---- - -Accumulo SortedKeyValueIterators, commonly referred to as Iterators for short, are server-side programming constructs -that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers. The name rightly -brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java -Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element -in the iteration, Accumulo Iterators must also support the ability to "move" (`seek`) to an specified point in the -iteration (the Accumulo table). Accumulo Iterators are designed to be concatenated together, similar to applying a -series of transformations to a list of elements. Accumulo Iterators can duplicate their underlying source to create -multiple "pointers" over the same underlying data (which is extremely powerful since each stream is sorted) or they can -merge multiple Iterators into a single view. In this sense, a collection of Iterators operating in tandem is close to -a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators -are not designed to act as triggers nor are they designed to operate outside of the purview of a single table. - -Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is -buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong -understanding of this as the interface provides clear definitions about what each action each method should take. This -chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common -pitfalls. - -## Instantiation - -To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class must be on the classpath of every -TabletServer. For production environments, it is common to place a JAR file which contains the Iterator in -`lib/`. In development environments, it is convenient to instead place the JAR file in `lib/ext/` as JAR files -in this directory are dynamically reloaded by the TabletServers alleviating the need to restart Accumulo while -testing an Iterator. Advanced classloader features which enable other types of filesystems and per-table classpath -configurations (as opposed to process-wide classpaths). These features are not covered here, but elsewhere in the user -manual. - -Accumulo references the Iterator class by name and uses Java reflection to instantiate the Iterator. This means that -Iterators must have a public no-args constructor. - -## Interface - -A normal implementation of the SortedKeyValueIterator defines functionality for the following methods: - -```java -void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException; - -boolean hasTop(); - -void next() throws IOException; - -void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException; - -Key getTopKey(); - -Value getTopValue(); - -SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env); -``` - -### init - -The `init` method is called by the TabletServer after it constructs an instance of the Iterator. This method should -clear/reset any internal state in the Iterator and prepare it to process data. The first argument, the `source`, is the -Iterator "below" this Iterator (where the client is at "top" and the Iterator for files in HDFS are at the "bottom"). -The "source" Iterator provides the Key-Value pairs which this Iterator will operate upon. - -The second argument, a Map of options, is made up of options provided by the user, options set in the table's -configuration, and/or options set in the containing namespace's configuration. -These options allow for Iterators to dynamically configure themselves on the fly. If no options are used in the current context -(a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter -Key-Value pairs in a regular expression Iterator. - -The third argument, the `IteratorEnvironment`, is a special object which provides information to this Iterator about the -context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator -knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan -(which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to -optimize itself. - -### seek - -The `seek` method is likely the most confusing method on the Iterator interface. The purpose of this method is to -advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before -the implementation of this method returns some additional processing is performed which may further advance the current -position past the `startKey` of the `Range`. This, however, is dependent on the functionality the iterator provides. For -example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before `seek` -returns. The important condition for `seek` to meet is that this Iterator should be ready to return the first Key-Value -pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by `getTopKey` -and `getTopValue`, respectively, and `hasTop` should return a boolean denoting whether or not there is -a Key-Value pair to return. - -The arguments passed to seek are as follows: - -The TabletServer first provides a `Range`, an object which defines some collection of Accumulo `Key`s, which defines the -Key-Value pairs that this Iterator should return. Each `Range` has a `startKey` and `endKey` with an inclusive flag for -both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not -guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on -Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s) -provided to `seek` and Iterators should not return any Keys that fall outside of the provided Range. - -The second argument, a `Collection<ByteSequence>`, is the set of column families which should be retained or -excluded by this Iterator. The third argument, a boolean, defines whether the collection of column families -should be treated as an inclusion collection (true) or an exclusion collection (false). - -It is likely that all implementations of `seek` will first make a call to the `seek` method on the -"source" Iterator that was provided in the `init` method. The collection of column families and -the boolean `include` argument should be passed down as well as the `Range`. Somewhat commonly, the Iterator will -also implement some sort of additional logic to find or compute the first Key-Value pair in the provided -Range. For example, a regular expression Iterator would consume all records which do not match the given -pattern before returning from `seek`. - -It is important to retain the original Range passed to this method to know when this Iterator should stop -reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it -will result in duplicate keys emitting from a BatchScan if the scanned table has more than one tablet. -Best practice is to never emit entries outside the seek range. - -### next - -The `next` method is analogous to the `next` method on a Java Iterator: this method should advance -the Iterator to the next Key-Value pair. For implementations that perform some filtering or complex -logic, this may result in more than one Key-Value pair being inspected. This method alters -some internal state that is exposed via the `hasTop`, `getTopKey`, and `getTopValue` methods. - -The result of this method is commonly caching a Key-Value pair which `getTopKey` and `getTopValue` -can later return. While there is another Key-Value pair to return, `hasTop` should return true. -If there are no more Key-Value pairs to return from this Iterator since the last call to -`seek`, `hasTop` should return false. - -### hasTop - -The `hasTop` method is similar to the `hasNext` method on a Java Iterator in that it informs -the caller if there is a Key-Value pair to be returned. If there is no pair to return, this method -should return false. Like a Java Iterator, multiple calls to `hasTop` (without calling `next`) should not -alter the internal state of the Iterator. - -### getTopKey and getTopValue - -These methods simply return the current Key-Value pair for this iterator. If `hasTop` returns true, -both of these methods should return non-null objects. If `hasTop` returns false, it is undefined -what these methods should return. Like `hasTop`, multiple calls to these methods should not alter -the state of the Iterator. - -Users should take caution when either - -1. caching the Key/Value from `getTopKey`/`getTopValue`, for use after calling `next` on the source iterator. -In this case, the cached Key/Value object is aliased to the reference returned by the source iterator. -Iterators may reuse the same Key/Value object in a `next` call for performance reasons, changing the data -that the cached Key/Value object references and resulting in a logic bug. -2. modifying the Key/Value from `getTopKey`/`getTopValue`. If the source iterator reuses data stored in the Key/Value, -then the source iterator may use the modified data that the Key/Value references. This may/may not result in a logic bug. - -In both cases, copying the Key/Value's data into a new object ensures iterator correctness. If neither case applies, -it is safe to not copy the Key/Value. The general guideline is to be aware of who else may use Key/Value objects -returned from `getTopKey`/`getTopValue`. - -### deepCopy - -The `deepCopy` method is similar to the `clone` method from the Java `Cloneable` interface. -Implementations of this method should return a new object of the same type as the Accumulo Iterator -instance it was called on. Any internal state from the instance `deepCopy` was called -on should be carried over to the returned copy. The returned copy should be ready to have -`seek` called on it. The SortedKeyValueIterator interface guarantees that `init` will be called on -an iterator before `deepCopy` and that `init` will not be called on the iterator returned by -`deepCopy`. - -Typically, implementations of `deepCopy` call a copy-constructor which will initialize -internal data structures. As with `seek`, it is common for the `IteratorEnvironment` -argument to be ignored as most Iterator implementations can be written without the explicit -information the environment provides. - -In the analogy of a series of Iterators representing a tree, `deepCopy` can be thought of as -early programming assignments which implement their own tree data structures. `deepCopy` calls -copy on its sources (the children), copies itself, attaches the copies of the children, and -then returns itself. - -## TabletServer invocation of Iterators - -The following code is a general outline for how TabletServers invoke Iterators. - -```java -List<KeyValue> batch; -Range range = getRangeFromClient(); -while(!overSizeLimit(batch)){ - SortedKeyValueIterator source = getSystemIterator(); - - for(String clzName : getUserIterators()){ - Class<?> clz = Class.forName(clzName); - SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance(); - iter.init(source, opts, env); - source = iter; - } - - // read a batch of data to return to client - // the last iterator, the "top" - SortedKeyValueIterator topIter = source; - topIter.seek(getRangeFromUser(), ...) - - while(topIter.hasTop() && !overSizeLimit(batch)){ - key = topIter.getTopKey() - val = topIter.getTopValue() - batch.add(new KeyValue(key, val) - if(systemDataSourcesChanged()){ - // code does not show isolation case, which will - // keep using same data sources until a row boundry is hit - range = new Range(key, false, range.endKey(), range.endKeyInclusive()); - break; - } - } -} -//return batch of key values to client -``` - -Additionally, the obtuse "re-seek" case can be outlined as the following: - -```java -// Given the above -List<KeyValue> batch = getNextBatch(); - -// Store off lastKeyReturned for this client -lastKeyReturned = batch.get(batch.size() - 1).getKey(); - -// thread goes away (client stops asking for the next batch). - -// Eventually client comes back -// Setup as before... - -Range userRange = getRangeFromUser(); -Range actualRange = new Range(lastKeyReturned, false - userRange.getEndKey(), userRange.isEndKeyInclusive()); - -// Use the actualRange, not the user provided one -topIter.seek(actualRange); -``` - -## Isolation - -Accumulo provides a feature which clients can enable to prevent the viewing of partially -applied mutations within the context of rows. If a client is submitting multiple column -updates to rows at a time, isolation would ensure that a client would either see all of -updates made to that row or none of the updates (until they are all applied). - -When using Isolation, there are additional concerns in iterator design. A scan time iterator in accumulo -reads from a set of data sources. While an iterator is reading data it has an isolated view. However, after it returns a -key/value it is possible that accumulo may switch data sources and re-seek the iterator. This is done so that resources -may be reclaimed. When the user does not request isolation this can occur after any key is returned. When a user enables -Isolation, this will only occur after a new row is returned, in which case it will re-seek to the very beginning of the -next possible row. - -## Abstract Iterators - -A number of Abstract implementations of Iterators are provided to allow for faster creation -of common patterns. The most commonly used abstract implementations are the `Filter` and -`Combiner` classes. When possible these classes should be used instead as they have been -thoroughly tested inside Accumulo itself. - -### Filter - -The `Filter` abstract Iterator provides a very simple implementation which allows implementations -to define whether or not a Key-Value pair should be returned via an `accept(Key, Value)` method. - -Filters are extremely simple to implement; however, when the implementation is filtering a -large percentage of Key-Value pairs with respect to the total number of pairs examined, -it can be very inefficient. For example, if a Filter implementation can determine after examining -part of the row that no other pairs in this row will be accepted, there is no mechanism to -efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of -1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined -that no other Key-Value pairs in this row will be accepted. The Filter must still examine each -remaining 990 Key-Value pairs in this row. Another way to express this deficiency is that -Filters have no means to leverage the `seek` method to efficiently skip large portions -of Key-Value pairs. - -As such, the `Filter` class functions well for filtering small amounts of data, but is -inefficient for filtering large amounts of data. The decision to use a `Filter` strongly -depends on the use case and distribution of data being filtered. - -### Combiner - -The `Combiner` class is another common abstract Iterator. Similar to the `Combiner` interface -define in Hadoop's MapReduce framework, implementations of this abstract class reduce -multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair. -Combiners provide a simple way to implement common operations like summation and -aggregation without the need to implement the entire Accumulo Iterator interface. - -One important consideration when choosing to design a Combiner is that the "reduction" operation -is often best represented when it is associative and commutative. Operations which do not meet -these criteria can be implemented; however, the implementation can be difficult. - -A second consideration is that a Combiner is not guaranteed to see every Key-Value pair -which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value -pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that -every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for -Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the -timestamps 1, 2, 4 and 5. - -Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the -Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The -Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value -pair that has the largest timestamp. - -## Best practices - -Because of the flexibility that the `SortedKeyValueInterface` provides, it doesn't directly disallow -many implementations which are poor design decisions. The following are some common recommendations to -follow and pitfalls to avoid in Iterator implementations. - -#### Avoid special logic encoded in Ranges - -Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` or `BatchScanner` are unmodified. -If a `Range` falls within the boundaries of a Tablet, an Iterator will often see that same Range in the -`seek` method. However, there is no guarantee that the `Range` will remain unaltered from client to server. As such, Iterators -should *never* make assumptions about the current state/context based on the `Range`. - -The common failure condition is referred to as a "re-seek". In the context of a Scan, TabletServers construct the -"stack" of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value -pairs are collected, it is common for the Iterators to be "torn down" until the client asks for the next batch of -Key-Value pairs. This is done by the TabletServer to add fairness in ensuring one Scan does not monopolize the available -resources. When the client asks for the next batch, the implementation modifies the original Range so that servers know -the point to resume the iteration (to avoid returning duplicate Key-Value pairs). Specifically, the new Range is created -from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan, -non-inclusive. - -### `seek`'ing backwards - -The ability for an Iterator to "skip over" large blocks of Key-Value pairs is a major tenet behind Iterators. -By `seek`'ing when it is known that there is a collection of Key-Value pairs which can be ignored can -greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed. - -While the `seek` method provides the `Range` that should be used to `seek` the underlying source Iterator, -there is no guarantee that the implementing Iterator uses that `Range` to perform the `seek` on its -"source" Iterator. As such, it is possible to seek to any `Range` and the interface has no assertions -to prevent this from happening. - -Since Iterators are allowed to `seek` to arbitrary Keys, it also allows Iterators to create infinite loops -inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should -construct a completely new Range as it allows for bugs to be introduced which will break Accumulo. - -Thus, `seek`'s should always be thought of as making "forward progress" in the view of the total iteration. The -`startKey` of a `Range` should always be greater than the current Key seen by the Iterator while the `endKey` of the -`Range` should always retain the original `endKey` (and `endKey` inclusivity) of the last `Range` seen by your -Iterator's implementation of seek. - -### Take caution in constructing new data in an Iterator - -Implementations of Iterator might be tempted to open BatchWriters inside of an Iterator as a means -to implement triggers for writing additional data outside of their client application. The lifecycle of an Iterator -is *not* managed in such a way that guarantees that this is safe nor efficient. Specifically, there -is no way to guarantee that the internal ThreadPool inside of the BatchWriter is closed (and the thread(s) -are reaped) without calling the close() method. `close`'ing and recreating a `BatchWriter` after every -Key-Value pair is also prohibitively performance limiting to be considered an option. - -The only safe way to generate additional data in an Iterator is to alter the current Key-Value pair. -For example, the `WholeRowIterator` serializes the all of the Key-Values pairs that fall within each -row. A safe way to generate more data in an Iterator would be to construct an Iterator that is -"higher" (at a larger priority) than the `WholeRowIterator`, that is, the Iterator receives the Key-Value pairs which are -a serialization of many Key-Value pairs. The custom Iterator could deserialize the pairs, compute -some function, and add a new Key-Value pair to the original collection, re-serializing the collection -of Key-Value pairs back into a single Key-Value pair. - -Any other situation is likely not guaranteed to ensure that the caller (a Scan or a Compaction) will -always see all intended data that is generated. - -## Final things to remember - -Some simple recommendations/points to keep in mind: - -### Method call order - -On an instance of an Iterator: `init` is always called before `seek`, `seek` is always called before `hasTop`, -`getTopKey` and `getTopValue` will not be called if `hasTop` returns false. - -### Teardown - -As mentioned, instance of Iterators may be torn down inside of the server transparently. When a complex -collection of iterators is performing some advanced functionality, they will not be torn down until a Key-Value -pair is returned out of the "stack" of Iterators (and added into the batch of Key-Values to be returned -to the caller). Being torn-down is equivalent to a new instance of the Iterator being creating and `deepCopy` -being called on the new instance with the old instance provided as the argument to `deepCopy`. References -to the old instance are removed and the object is lazily garbage collected by the JVM. - -## Compaction-time Iterators - -When Iterators are configured to run during compactions, at the `minc` or `majc` scope, these Iterators sometimes need -to make different assertions than those who only operate at scan time. Iterators won't see the delete entries; however, -Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite -all files (only a subset of them), it is possible that the logic take this into consideration. - -For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The -Combiner must recognize this and not perform any function that would be incorrect due -to the missing values. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/iterator_testing.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/iterator_testing.md b/_docs-unreleased/development/iterator_testing.md deleted file mode 100644 index a0e82de..0000000 --- a/_docs-unreleased/development/iterator_testing.md +++ /dev/null @@ -1,97 +0,0 @@ ---- -title: Iterator Testing -category: development -order: 2 ---- - -Iterators, while extremely powerful, are notoriously difficult to test. While the API defines -the methods an Iterator must implement and each method's functionality, the actual invocation -of these methods by Accumulo TabletServers can be surprisingly difficult to mimic in unit tests. - -The Apache Accumulo "Iterator Test Harness" is designed to provide a generalized testing framework -for all Accumulo Iterators to leverage to identify common pitfalls in user-created Iterators. - -## Framework Use - -The harness provides an abstract class for use with JUnit4. Users must define the following for this -abstract class: - - * A `SortedMap` of input data (`Key`-`Value` pairs) - * A `Range` to use in tests - * A `Map` of options (`String` to `String` pairs) - * A `SortedMap` of output data (`Key`-`Value` pairs) - * A list of `IteratorTestCase`s (these can be automatically discovered) - -The majority of effort a user must make is in creating the input dataset and the expected -output dataset for the iterator being tested. - -## Normal Test Outline - -Most iterator tests will follow the given outline: - -```java -import java.util.List; -import java.util.SortedMap; - -import org.apache.accumulo.core.data.Key; -import org.apache.accumulo.core.data.Range; -import org.apache.accumulo.core.data.Value; -import org.apache.accumulo.iteratortest.IteratorTestCaseFinder; -import org.apache.accumulo.iteratortest.IteratorTestInput; -import org.apache.accumulo.iteratortest.IteratorTestOutput; -import org.apache.accumulo.iteratortest.junit4.BaseJUnit4IteratorTest; -import org.apache.accumulo.iteratortest.testcases.IteratorTestCase; -import org.junit.runners.Parameterized.Parameters; - -public class MyIteratorTest extends BaseJUnit4IteratorTest { - - @Parameters - public static Object[][] parameters() { - final IteratorTestInput input = createIteratorInput(); - final IteratorTestOutput output = createIteratorOutput(); - final List<IteratorTestCase> testCases = IteratorTestCaseFinder.findAllTestCases(); - return BaseJUnit4IteratorTest.createParameters(input, output, tests); - } - - private static SortedMap<Key,Value> INPUT_DATA = createInputData(); - private static SortedMap<Key,Value> OUTPUT_DATA = createOutputData(); - - private static SortedMap<Key,Value> createInputData() { - // TODO -- implement this method - } - - private static SortedMap<Key,Value> createOutputData() { - // TODO -- implement this method - } - - private static IteratorTestInput createIteratorInput() { - final Map<String,String> options = createIteratorOptions(); - final Range range = createRange(); - return new IteratorTestInput(MyIterator.class, options, range, INPUT_DATA); - } - - private static Map<String,String> createIteratorOptions() { - // TODO -- implement this method - // Tip: Use INPUT_DATA if helpful in generating output - } - - private static Range createRange() { - // TODO -- implement this method - } - - private static IteratorTestOutput createIteratorOutput() { - return new IteratorTestOutput(OUTPUT_DATA); - } -} -``` - -## Limitations - -While the provided `IteratorTestCase`s should exercise common edge-cases in user iterators, -there are still many limitations to the existing test harness. Some of them are: - - * Can only specify a single iterator, not many (a "stack") - * No control over provided IteratorEnvironment for tests - * Exercising delete keys (especially with major compactions that do not include all files) - -These are left as future improvements to the harness. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/iterators.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/iterators.md b/_docs-unreleased/development/iterators.md new file mode 100644 index 0000000..947d5e0 --- /dev/null +++ b/_docs-unreleased/development/iterators.md @@ -0,0 +1,419 @@ +--- +title: Iterators +category: development +order: 1 +--- + +Accumulo SortedKeyValueIterators, commonly referred to as **Iterators** for short, are server-side programming constructs +that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers. The name rightly +brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java +Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element +in the iteration, Accumulo Iterators must also support the ability to "move" (`seek`) to an specified point in the +iteration (the Accumulo table). Accumulo Iterators are designed to be concatenated together, similar to applying a +series of transformations to a list of elements. Accumulo Iterators can duplicate their underlying source to create +multiple "pointers" over the same underlying data (which is extremely powerful since each stream is sorted) or they can +merge multiple Iterators into a single view. In this sense, a collection of Iterators operating in tandem is close to +a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators +are not designed to act as triggers nor are they designed to operate outside of the purview of a single table. + +Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is +buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong +understanding of this as the interface provides clear definitions about what each action each method should take. This +chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common +pitfalls. + +## Instantiation + +To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class must be on the classpath of every +TabletServer. For production environments, it is common to place a JAR file which contains the Iterator in +`lib/`. In development environments, it is convenient to instead place the JAR file in `lib/ext/` as JAR files +in this directory are dynamically reloaded by the TabletServers alleviating the need to restart Accumulo while +testing an Iterator. Advanced classloader features which enable other types of filesystems and per-table classpath +configurations (as opposed to process-wide classpaths). These features are not covered here, but elsewhere in the user +manual. + +Accumulo references the Iterator class by name and uses Java reflection to instantiate the Iterator. This means that +Iterators must have a public no-args constructor. + +## Interface + +A normal implementation of the SortedKeyValueIterator defines functionality for the following methods: + +```java +void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException; + +boolean hasTop(); + +void next() throws IOException; + +void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException; + +Key getTopKey(); + +Value getTopValue(); + +SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env); +``` + +### init + +The `init` method is called by the TabletServer after it constructs an instance of the Iterator. This method should +clear/reset any internal state in the Iterator and prepare it to process data. The first argument, the `source`, is the +Iterator "below" this Iterator (where the client is at "top" and the Iterator for files in HDFS are at the "bottom"). +The "source" Iterator provides the Key-Value pairs which this Iterator will operate upon. + +The second argument, a Map of options, is made up of options provided by the user, options set in the table's +configuration, and/or options set in the containing namespace's configuration. +These options allow for Iterators to dynamically configure themselves on the fly. If no options are used in the current context +(a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter +Key-Value pairs in a regular expression Iterator. + +The third argument, the `IteratorEnvironment`, is a special object which provides information to this Iterator about the +context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator +knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan +(which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to +optimize itself. + +### seek + +The `seek` method is likely the most confusing method on the Iterator interface. The purpose of this method is to +advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before +the implementation of this method returns some additional processing is performed which may further advance the current +position past the `startKey` of the `Range`. This, however, is dependent on the functionality the iterator provides. For +example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before `seek` +returns. The important condition for `seek` to meet is that this Iterator should be ready to return the first Key-Value +pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by `getTopKey` +and `getTopValue`, respectively, and `hasTop` should return a boolean denoting whether or not there is +a Key-Value pair to return. + +The arguments passed to seek are as follows: + +The TabletServer first provides a `Range`, an object which defines some collection of Accumulo `Key`s, which defines the +Key-Value pairs that this Iterator should return. Each `Range` has a `startKey` and `endKey` with an inclusive flag for +both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not +guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on +Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s) +provided to `seek` and Iterators should not return any Keys that fall outside of the provided Range. + +The second argument, a `Collection<ByteSequence>`, is the set of column families which should be retained or +excluded by this Iterator. The third argument, a boolean, defines whether the collection of column families +should be treated as an inclusion collection (true) or an exclusion collection (false). + +It is likely that all implementations of `seek` will first make a call to the `seek` method on the +"source" Iterator that was provided in the `init` method. The collection of column families and +the boolean `include` argument should be passed down as well as the `Range`. Somewhat commonly, the Iterator will +also implement some sort of additional logic to find or compute the first Key-Value pair in the provided +Range. For example, a regular expression Iterator would consume all records which do not match the given +pattern before returning from `seek`. + +It is important to retain the original Range passed to this method to know when this Iterator should stop +reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it +will result in duplicate keys emitting from a BatchScan if the scanned table has more than one tablet. +Best practice is to never emit entries outside the seek range. + +### next + +The `next` method is analogous to the `next` method on a Java Iterator: this method should advance +the Iterator to the next Key-Value pair. For implementations that perform some filtering or complex +logic, this may result in more than one Key-Value pair being inspected. This method alters +some internal state that is exposed via the `hasTop`, `getTopKey`, and `getTopValue` methods. + +The result of this method is commonly caching a Key-Value pair which `getTopKey` and `getTopValue` +can later return. While there is another Key-Value pair to return, `hasTop` should return true. +If there are no more Key-Value pairs to return from this Iterator since the last call to +`seek`, `hasTop` should return false. + +### hasTop + +The `hasTop` method is similar to the `hasNext` method on a Java Iterator in that it informs +the caller if there is a Key-Value pair to be returned. If there is no pair to return, this method +should return false. Like a Java Iterator, multiple calls to `hasTop` (without calling `next`) should not +alter the internal state of the Iterator. + +### getTopKey and getTopValue + +These methods simply return the current Key-Value pair for this iterator. If `hasTop` returns true, +both of these methods should return non-null objects. If `hasTop` returns false, it is undefined +what these methods should return. Like `hasTop`, multiple calls to these methods should not alter +the state of the Iterator. + +Users should take caution when either + +1. caching the Key/Value from `getTopKey`/`getTopValue`, for use after calling `next` on the source iterator. +In this case, the cached Key/Value object is aliased to the reference returned by the source iterator. +Iterators may reuse the same Key/Value object in a `next` call for performance reasons, changing the data +that the cached Key/Value object references and resulting in a logic bug. +2. modifying the Key/Value from `getTopKey`/`getTopValue`. If the source iterator reuses data stored in the Key/Value, +then the source iterator may use the modified data that the Key/Value references. This may/may not result in a logic bug. + +In both cases, copying the Key/Value's data into a new object ensures iterator correctness. If neither case applies, +it is safe to not copy the Key/Value. The general guideline is to be aware of who else may use Key/Value objects +returned from `getTopKey`/`getTopValue`. + +### deepCopy + +The `deepCopy` method is similar to the `clone` method from the Java `Cloneable` interface. +Implementations of this method should return a new object of the same type as the Accumulo Iterator +instance it was called on. Any internal state from the instance `deepCopy` was called +on should be carried over to the returned copy. The returned copy should be ready to have +`seek` called on it. The SortedKeyValueIterator interface guarantees that `init` will be called on +an iterator before `deepCopy` and that `init` will not be called on the iterator returned by +`deepCopy`. + +Typically, implementations of `deepCopy` call a copy-constructor which will initialize +internal data structures. As with `seek`, it is common for the `IteratorEnvironment` +argument to be ignored as most Iterator implementations can be written without the explicit +information the environment provides. + +In the analogy of a series of Iterators representing a tree, `deepCopy` can be thought of as +early programming assignments which implement their own tree data structures. `deepCopy` calls +copy on its sources (the children), copies itself, attaches the copies of the children, and +then returns itself. + +## TabletServer invocation of Iterators + +The following code is a general outline for how TabletServers invoke Iterators. + +```java +List<KeyValue> batch; +Range range = getRangeFromClient(); +while(!overSizeLimit(batch)){ + SortedKeyValueIterator source = getSystemIterator(); + + for(String clzName : getUserIterators()){ + Class<?> clz = Class.forName(clzName); + SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance(); + iter.init(source, opts, env); + source = iter; + } + + // read a batch of data to return to client + // the last iterator, the "top" + SortedKeyValueIterator topIter = source; + topIter.seek(getRangeFromUser(), ...) + + while(topIter.hasTop() && !overSizeLimit(batch)){ + key = topIter.getTopKey() + val = topIter.getTopValue() + batch.add(new KeyValue(key, val) + if(systemDataSourcesChanged()){ + // code does not show isolation case, which will + // keep using same data sources until a row boundry is hit + range = new Range(key, false, range.endKey(), range.endKeyInclusive()); + break; + } + } +} +//return batch of key values to client +``` + +Additionally, the obtuse "re-seek" case can be outlined as the following: + +```java +// Given the above +List<KeyValue> batch = getNextBatch(); + +// Store off lastKeyReturned for this client +lastKeyReturned = batch.get(batch.size() - 1).getKey(); + +// thread goes away (client stops asking for the next batch). + +// Eventually client comes back +// Setup as before... + +Range userRange = getRangeFromUser(); +Range actualRange = new Range(lastKeyReturned, false + userRange.getEndKey(), userRange.isEndKeyInclusive()); + +// Use the actualRange, not the user provided one +topIter.seek(actualRange); +``` + +## Isolation + +Accumulo provides a feature which clients can enable to prevent the viewing of partially +applied mutations within the context of rows. If a client is submitting multiple column +updates to rows at a time, isolation would ensure that a client would either see all of +updates made to that row or none of the updates (until they are all applied). + +When using Isolation, there are additional concerns in iterator design. A scan time iterator in accumulo +reads from a set of data sources. While an iterator is reading data it has an isolated view. However, after it returns a +key/value it is possible that accumulo may switch data sources and re-seek the iterator. This is done so that resources +may be reclaimed. When the user does not request isolation this can occur after any key is returned. When a user enables +Isolation, this will only occur after a new row is returned, in which case it will re-seek to the very beginning of the +next possible row. + +## Abstract Iterators + +A number of Abstract implementations of Iterators are provided to allow for faster creation +of common patterns. The most commonly used abstract implementations are the `Filter` and +`Combiner` classes. When possible these classes should be used instead as they have been +thoroughly tested inside Accumulo itself. + +### Filter + +The `Filter` abstract Iterator provides a very simple implementation which allows implementations +to define whether or not a Key-Value pair should be returned via an `accept(Key, Value)` method. + +Filters are extremely simple to implement; however, when the implementation is filtering a +large percentage of Key-Value pairs with respect to the total number of pairs examined, +it can be very inefficient. For example, if a Filter implementation can determine after examining +part of the row that no other pairs in this row will be accepted, there is no mechanism to +efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of +1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined +that no other Key-Value pairs in this row will be accepted. The Filter must still examine each +remaining 990 Key-Value pairs in this row. Another way to express this deficiency is that +Filters have no means to leverage the `seek` method to efficiently skip large portions +of Key-Value pairs. + +As such, the `Filter` class functions well for filtering small amounts of data, but is +inefficient for filtering large amounts of data. The decision to use a `Filter` strongly +depends on the use case and distribution of data being filtered. + +### Combiner + +The `Combiner` class is another common abstract Iterator. Similar to the `Combiner` interface +define in Hadoop's MapReduce framework, implementations of this abstract class reduce +multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair. +Combiners provide a simple way to implement common operations like summation and +aggregation without the need to implement the entire Accumulo Iterator interface. + +One important consideration when choosing to design a Combiner is that the "reduction" operation +is often best represented when it is associative and commutative. Operations which do not meet +these criteria can be implemented; however, the implementation can be difficult. + +A second consideration is that a Combiner is not guaranteed to see every Key-Value pair +which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value +pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that +every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for +Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the +timestamps 1, 2, 4 and 5. + +Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the +Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The +Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value +pair that has the largest timestamp. + +#### Combiner Applications + +Many applications can benefit from the ability to aggregate values across common +keys. This can be done via Combiner iterators and is similar to the Reduce step in +MapReduce. This provides the ability to define online, incrementally updated +analytics without the overhead or latency associated with batch-oriented +MapReduce jobs. + +All that is needed to aggregate values of a table is to identify the fields over which +values will be grouped, insert mutations with those fields as the key, and configure +the table with a combining iterator that supports the summarizing operation +desired. + +The only restriction on an combining iterator is that the combiner developer +should not assume that all values for a given key have been seen, since new +mutations can be inserted at anytime. This precludes using the total number of +values in the aggregation such as when calculating an average, for example. + +An interesting use of combining iterators within an Accumulo table is to store +feature vectors for use in machine learning algorithms. For example, many +algorithms such as k-means clustering, support vector machines, anomaly detection, +etc. use the concept of a feature vector and the calculation of distance metrics to +learn a particular model. The columns in an Accumulo table can be used to efficiently +store sparse features and their weights to be incrementally updated via the use of an +combining iterator. + +## Best practices + +Because of the flexibility that the `SortedKeyValueInterface` provides, it doesn't directly disallow +many implementations which are poor design decisions. The following are some common recommendations to +follow and pitfalls to avoid in Iterator implementations. + +#### Avoid special logic encoded in Ranges + +Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` or `BatchScanner` are unmodified. +If a `Range` falls within the boundaries of a Tablet, an Iterator will often see that same Range in the +`seek` method. However, there is no guarantee that the `Range` will remain unaltered from client to server. As such, Iterators +should *never* make assumptions about the current state/context based on the `Range`. + +The common failure condition is referred to as a "re-seek". In the context of a Scan, TabletServers construct the +"stack" of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value +pairs are collected, it is common for the Iterators to be "torn down" until the client asks for the next batch of +Key-Value pairs. This is done by the TabletServer to add fairness in ensuring one Scan does not monopolize the available +resources. When the client asks for the next batch, the implementation modifies the original Range so that servers know +the point to resume the iteration (to avoid returning duplicate Key-Value pairs). Specifically, the new Range is created +from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan, +non-inclusive. + +### `seek`'ing backwards + +The ability for an Iterator to "skip over" large blocks of Key-Value pairs is a major tenet behind Iterators. +By `seek`'ing when it is known that there is a collection of Key-Value pairs which can be ignored can +greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed. + +While the `seek` method provides the `Range` that should be used to `seek` the underlying source Iterator, +there is no guarantee that the implementing Iterator uses that `Range` to perform the `seek` on its +"source" Iterator. As such, it is possible to seek to any `Range` and the interface has no assertions +to prevent this from happening. + +Since Iterators are allowed to `seek` to arbitrary Keys, it also allows Iterators to create infinite loops +inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should +construct a completely new Range as it allows for bugs to be introduced which will break Accumulo. + +Thus, `seek`'s should always be thought of as making "forward progress" in the view of the total iteration. The +`startKey` of a `Range` should always be greater than the current Key seen by the Iterator while the `endKey` of the +`Range` should always retain the original `endKey` (and `endKey` inclusivity) of the last `Range` seen by your +Iterator's implementation of seek. + +### Take caution in constructing new data in an Iterator + +Implementations of Iterator might be tempted to open BatchWriters inside of an Iterator as a means +to implement triggers for writing additional data outside of their client application. The lifecycle of an Iterator +is *not* managed in such a way that guarantees that this is safe nor efficient. Specifically, there +is no way to guarantee that the internal ThreadPool inside of the BatchWriter is closed (and the thread(s) +are reaped) without calling the close() method. `close`'ing and recreating a `BatchWriter` after every +Key-Value pair is also prohibitively performance limiting to be considered an option. + +The only safe way to generate additional data in an Iterator is to alter the current Key-Value pair. +For example, the `WholeRowIterator` serializes the all of the Key-Values pairs that fall within each +row. A safe way to generate more data in an Iterator would be to construct an Iterator that is +"higher" (at a larger priority) than the `WholeRowIterator`, that is, the Iterator receives the Key-Value pairs which are +a serialization of many Key-Value pairs. The custom Iterator could deserialize the pairs, compute +some function, and add a new Key-Value pair to the original collection, re-serializing the collection +of Key-Value pairs back into a single Key-Value pair. + +Any other situation is likely not guaranteed to ensure that the caller (a Scan or a Compaction) will +always see all intended data that is generated. + +## Final things to remember + +Some simple recommendations/points to keep in mind: + +### Method call order + +On an instance of an Iterator: `init` is always called before `seek`, `seek` is always called before `hasTop`, +`getTopKey` and `getTopValue` will not be called if `hasTop` returns false. + +### Teardown + +As mentioned, instance of Iterators may be torn down inside of the server transparently. When a complex +collection of iterators is performing some advanced functionality, they will not be torn down until a Key-Value +pair is returned out of the "stack" of Iterators (and added into the batch of Key-Values to be returned +to the caller). Being torn-down is equivalent to a new instance of the Iterator being creating and `deepCopy` +being called on the new instance with the old instance provided as the argument to `deepCopy`. References +to the old instance are removed and the object is lazily garbage collected by the JVM. + +## Compaction-time Iterators + +When Iterators are configured to run during compactions, at the `minc` or `majc` scope, these Iterators sometimes need +to make different assertions than those who only operate at scan time. Iterators won't see the delete entries; however, +Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite +all files (only a subset of them), it is possible that the logic take this into consideration. + +For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The +Combiner must recognize this and not perform any function that would be incorrect due +to the missing values. + +## Testing + +The [Iterator test harness][iterator-test-harness] is generalized testing framework for Accumulo Iterators that can +identify common pitfalls in user-created Iterators. + +[iterator-test-harness]: {{ page.docs_baseurl }}/development/development_tools#iterator-test-harness http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/mapreduce.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/mapreduce.md b/_docs-unreleased/development/mapreduce.md new file mode 100644 index 0000000..98b2682 --- /dev/null +++ b/_docs-unreleased/development/mapreduce.md @@ -0,0 +1,181 @@ +--- +title: MapReduce +category: development +order: 2 +--- + +Accumulo tables can be used as the source and destination of MapReduce jobs. To +use an Accumulo table with a MapReduce job (specifically with the new Hadoop API +as of version 0.20), configure the job parameters to use the AccumuloInputFormat +and AccumuloOutputFormat. Accumulo specific parameters can be set via these +two format classes to do the following: + +* Authenticate and provide user credentials for the input +* Restrict the scan to a range of rows +* Restrict the input to a subset of available columns + +## Mapper and Reducer classes + +To read from an Accumulo table create a Mapper with the following class +parameterization and be sure to configure the AccumuloInputFormat. + +```java +class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> { + public void map(Key k, Value v, Context c) { + // transform key and value data here + } +} +``` + +To write to an Accumulo table, create a Reducer with the following class +parameterization and be sure to configure the AccumuloOutputFormat. The key +emitted from the Reducer identifies the table to which the mutation is sent. This +allows a single Reducer to write to more than one table if desired. A default table +can be configured using the AccumuloOutputFormat, in which case the output table +name does not have to be passed to the Context object within the Reducer. + +```java +class MyReducer extends Reducer<WritableComparable, Writable, Text, Mutation> { + public void reduce(WritableComparable key, Iterable<Text> values, Context c) { + Mutation m; + // create the mutation based on input key and value + c.write(new Text("output-table"), m); + } +} +``` + +The Text object passed as the output should contain the name of the table to which +this mutation should be applied. The Text can be null in which case the mutation +will be applied to the default table name specified in the AccumuloOutputFormat +options. + +## AccumuloInputFormat options + +```java +Job job = new Job(getConf()); +AccumuloInputFormat.setInputInfo(job, + "user", + "passwd".getBytes(), + "table", + new Authorizations()); + +AccumuloInputFormat.setZooKeeperInstance(job, "myinstance", + "zooserver-one,zooserver-two"); +``` + +**Optional Settings:** + +To restrict Accumulo to a set of row ranges: + +```java +ArrayList<Range> ranges = new ArrayList<Range>(); +// populate array list of row ranges ... +AccumuloInputFormat.setRanges(job, ranges); +``` + +To restrict Accumulo to a list of columns: + +```java +ArrayList<Pair<Text,Text>> columns = new ArrayList<Pair<Text,Text>>(); +// populate list of columns +AccumuloInputFormat.fetchColumns(job, columns); +``` + +To use a regular expression to match row IDs: + +```java +IteratorSetting is = new IteratorSetting(30, RexExFilter.class); +RegExFilter.setRegexs(is, ".*suffix", null, null, null, true); +AccumuloInputFormat.addIterator(job, is); +``` + +## AccumuloMultiTableInputFormat options + +The AccumuloMultiTableInputFormat allows the scanning over multiple tables +in a single MapReduce job. Separate ranges, columns, and iterators can be +used for each table. + +```java +InputTableConfig tableOneConfig = new InputTableConfig(); +InputTableConfig tableTwoConfig = new InputTableConfig(); +``` + +To set the configuration objects on the job: + +```java +Map<String, InputTableConfig> configs = new HashMap<String,InputTableConfig>(); +configs.put("table1", tableOneConfig); +configs.put("table2", tableTwoConfig); +AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs); +``` + +**Optional settings:** + +To restrict to a set of ranges: + +```java +ArrayList<Range> tableOneRanges = new ArrayList<Range>(); +ArrayList<Range> tableTwoRanges = new ArrayList<Range>(); +// populate array lists of row ranges for tables... +tableOneConfig.setRanges(tableOneRanges); +tableTwoConfig.setRanges(tableTwoRanges); +``` + +To restrict Accumulo to a list of columns: + +```java +ArrayList<Pair<Text,Text>> tableOneColumns = new ArrayList<Pair<Text,Text>>(); +ArrayList<Pair<Text,Text>> tableTwoColumns = new ArrayList<Pair<Text,Text>>(); +// populate lists of columns for each of the tables ... +tableOneConfig.fetchColumns(tableOneColumns); +tableTwoConfig.fetchColumns(tableTwoColumns); +``` + +To set scan iterators: + +```java +List<IteratorSetting> tableOneIterators = new ArrayList<IteratorSetting>(); +List<IteratorSetting> tableTwoIterators = new ArrayList<IteratorSetting>(); +// populate the lists of iterator settings for each of the tables ... +tableOneConfig.setIterators(tableOneIterators); +tableTwoConfig.setIterators(tableTwoIterators); +``` + +The name of the table can be retrieved from the input split: + +```java +class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> { + public void map(Key k, Value v, Context c) { + RangeInputSplit split = (RangeInputSplit)c.getInputSplit(); + String tableName = split.getTableName(); + // do something with table name + } +} +``` + +## AccumuloOutputFormat options + +```java +boolean createTables = true; +String defaultTable = "mytable"; + +AccumuloOutputFormat.setOutputInfo(job, + "user", + "passwd".getBytes(), + createTables, + defaultTable); + +AccumuloOutputFormat.setZooKeeperInstance(job, "myinstance", + "zooserver-one,zooserver-two"); +``` + +**Optional Settings:** + +```java +AccumuloOutputFormat.setMaxLatency(job, 300000); // milliseconds +AccumuloOutputFormat.setMaxMutationBufferSize(job, 50000000); // bytes +``` + +The [MapReduce example][mapred-example] contains a complete example of using MapReduce with Accumulo. + +[mapred-example]: https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/proxy.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/proxy.md b/_docs-unreleased/development/proxy.md new file mode 100644 index 0000000..6e9f7eb --- /dev/null +++ b/_docs-unreleased/development/proxy.md @@ -0,0 +1,121 @@ +--- +title: Proxy +category: development +order: 3 +--- + +## Proxy + +The proxy API allows the interaction with Accumulo with languages other than Java. +A proxy server is provided in the codebase and a client can further be generated. +The proxy API can also be used instead of the traditional ZooKeeperInstance class to +provide a single TCP port in which clients can be securely routed through a firewall, +without requiring access to all tablet servers in the cluster. + +### Prerequisites + +The proxy server can live on any node in which the basic client API would work. That +means it must be able to communicate with the Master, ZooKeepers, NameNode, and the +DataNodes. A proxy client only needs the ability to communicate with the proxy server. + +### Configuration + +The configuration options for the proxy server live inside of a properties file. At +the very least, you need to supply the following properties: + + protocolFactory=org.apache.thrift.protocol.TCompactProtocol$Factory + tokenClass=org.apache.accumulo.core.client.security.tokens.PasswordToken + port=42424 + instance=test + zookeepers=localhost:2181 + +You can find a sample configuration file in your distribution at `proxy/proxy.properties`. + +This sample configuration file further demonstrates an ability to back the proxy server +by MockAccumulo or the MiniAccumuloCluster. + +### Running the Proxy Server + +After the properties file holding the configuration is created, the proxy server +can be started using the following command in the Accumulo distribution (assuming +your properties file is named `config.properties`): + + accumulo proxy -p config.properties + +### Creating a Proxy Client + +Aside from installing the Thrift compiler, you will also need the language-specific library +for Thrift installed to generate client code in that language. Typically, your operating +system's package manager will be able to automatically install these for you in an expected +location such as `/usr/lib/python/site-packages/thrift`. + +You can find the thrift file for generating the client at `proxy/proxy.thrift`. + +After a client is generated, the port specified in the configuration properties above will be +used to connect to the server. + +### Using a Proxy Client + +The following examples have been written in Java and the method signatures may be +slightly different depending on the language specified when generating client with +the Thrift compiler. After initiating a connection to the Proxy (see Apache Thrift's +documentation for examples of connecting to a Thrift service), the methods on the +proxy client will be available. The first thing to do is log in: + +```java +Map password = new HashMap<String,String>(); +password.put("password", "secret"); +ByteBuffer token = client.login("root", password); +``` + +Once logged in, the token returned will be used for most subsequent calls to the client. +Let's create a table, add some data, scan the table, and delete it. + +First, create a table. + +```java +client.createTable(token, "myTable", true, TimeType.MILLIS); +``` + +Next, add some data: + +```java +// first, create a writer on the server +String writer = client.createWriter(token, "myTable", new WriterOptions()); + +//rowid +ByteBuffer rowid = ByteBuffer.wrap("UUID".getBytes()); + +//mutation like class +ColumnUpdate cu = new ColumnUpdate(); +cu.setColFamily("MyFamily".getBytes()); +cu.setColQualifier("MyQualifier".getBytes()); +cu.setColVisibility("VisLabel".getBytes()); +cu.setValue("Some Value.".getBytes()); + +List<ColumnUpdate> updates = new ArrayList<ColumnUpdate>(); +updates.add(cu); + +// build column updates +Map<ByteBuffer, List<ColumnUpdate>> cellsToUpdate = new HashMap<ByteBuffer, List<ColumnUpdate>>(); +cellsToUpdate.put(rowid, updates); + +// send updates to the server +client.updateAndFlush(writer, "myTable", cellsToUpdate); + +client.closeWriter(writer); +``` + +Scan for the data and batch the return of the results on the server: + +```java +String scanner = client.createScanner(token, "myTable", new ScanOptions()); +ScanResult results = client.nextK(scanner, 100); + +for(KeyValue keyValue : results.getResultsIterator()) { + // do something with results +} + +client.closeScanner(scanner); +``` + http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/sampling.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/sampling.md b/_docs-unreleased/development/sampling.md index 4a76c39..b1c54ef 100644 --- a/_docs-unreleased/development/sampling.md +++ b/_docs-unreleased/development/sampling.md @@ -1,7 +1,7 @@ --- title: Sampling category: development -order: 4 +order: 5 --- ## Overview http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/security.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/security.md b/_docs-unreleased/development/security.md index ea1f997..0671d50 100644 --- a/_docs-unreleased/development/security.md +++ b/_docs-unreleased/development/security.md @@ -1,7 +1,7 @@ --- title: Security category: development -order: 6 +order: 7 --- Accumulo extends the BigTable data model to implement a security mechanism http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/development/summaries.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/summaries.md b/_docs-unreleased/development/summaries.md index a86e30d..1e8a8b4 100644 --- a/_docs-unreleased/development/summaries.md +++ b/_docs-unreleased/development/summaries.md @@ -1,7 +1,7 @@ --- title: Summary Statistics category: development -order: 5 +order: 6 --- ## Overview http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/817a0ef7/_docs-unreleased/getting-started/clients.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/getting-started/clients.md b/_docs-unreleased/getting-started/clients.md index 88d4a13..5dc52d3 100644 --- a/_docs-unreleased/getting-started/clients.md +++ b/_docs-unreleased/getting-started/clients.md @@ -265,120 +265,13 @@ You may consider using the [WholeRowIterator] with the BatchScanner to achieve isolation. The drawback of this approach is that entire rows are read into memory on the server side. If a row is too big, it may crash a tablet server. -## Proxy +## Additional Documentation -The proxy API allows the interaction with Accumulo with languages other than Java. -A proxy server is provided in the codebase and a client can further be generated. -The proxy API can also be used instead of the traditional ZooKeeperInstance class to -provide a single TCP port in which clients can be securely routed through a firewall, -without requiring access to all tablet servers in the cluster. +This page covers Accumulo client basics. Below are links to additional documentation that may be useful when creating Accumulo clients: -### Prerequisites - -The proxy server can live on any node in which the basic client API would work. That -means it must be able to communicate with the Master, ZooKeepers, NameNode, and the -DataNodes. A proxy client only needs the ability to communicate with the proxy server. - -### Configuration - -The configuration options for the proxy server live inside of a properties file. At -the very least, you need to supply the following properties: - - protocolFactory=org.apache.thrift.protocol.TCompactProtocol$Factory - tokenClass=org.apache.accumulo.core.client.security.tokens.PasswordToken - port=42424 - instance=test - zookeepers=localhost:2181 - -You can find a sample configuration file in your distribution at `proxy/proxy.properties`. - -This sample configuration file further demonstrates an ability to back the proxy server -by MockAccumulo or the MiniAccumuloCluster. - -### Running the Proxy Server - -After the properties file holding the configuration is created, the proxy server -can be started using the following command in the Accumulo distribution (assuming -your properties file is named `config.properties`): - - accumulo proxy -p config.properties - -### Creating a Proxy Client - -Aside from installing the Thrift compiler, you will also need the language-specific library -for Thrift installed to generate client code in that language. Typically, your operating -system's package manager will be able to automatically install these for you in an expected -location such as `/usr/lib/python/site-packages/thrift`. - -You can find the thrift file for generating the client at `proxy/proxy.thrift`. - -After a client is generated, the port specified in the configuration properties above will be -used to connect to the server. - -### Using a Proxy Client - -The following examples have been written in Java and the method signatures may be -slightly different depending on the language specified when generating client with -the Thrift compiler. After initiating a connection to the Proxy (see Apache Thrift's -documentation for examples of connecting to a Thrift service), the methods on the -proxy client will be available. The first thing to do is log in: - -```java -Map password = new HashMap<String,String>(); -password.put("password", "secret"); -ByteBuffer token = client.login("root", password); -``` - -Once logged in, the token returned will be used for most subsequent calls to the client. -Let's create a table, add some data, scan the table, and delete it. - -First, create a table. - -```java -client.createTable(token, "myTable", true, TimeType.MILLIS); -``` - -Next, add some data: - -```java -// first, create a writer on the server -String writer = client.createWriter(token, "myTable", new WriterOptions()); - -//rowid -ByteBuffer rowid = ByteBuffer.wrap("UUID".getBytes()); - -//mutation like class -ColumnUpdate cu = new ColumnUpdate(); -cu.setColFamily("MyFamily".getBytes()); -cu.setColQualifier("MyQualifier".getBytes()); -cu.setColVisibility("VisLabel".getBytes()); -cu.setValue("Some Value.".getBytes()); - -List<ColumnUpdate> updates = new ArrayList<ColumnUpdate>(); -updates.add(cu); - -// build column updates -Map<ByteBuffer, List<ColumnUpdate>> cellsToUpdate = new HashMap<ByteBuffer, List<ColumnUpdate>>(); -cellsToUpdate.put(rowid, updates); - -// send updates to the server -client.updateAndFlush(writer, "myTable", cellsToUpdate); - -client.closeWriter(writer); -``` - -Scan for the data and batch the return of the results on the server: - -```java -String scanner = client.createScanner(token, "myTable", new ScanOptions()); -ScanResult results = client.nextK(scanner, 100); - -for(KeyValue keyValue : results.getResultsIterator()) { - // do something with results -} - -client.closeScanner(scanner); -``` +* [Iterators] - Server-side programming mechanism that can modify key/value pairs at various points in data management process +* [Proxy] - Documentation for interacting with Accumulo using non-Java languages through a proxy server +* [MapReduce] - Documentation for reading and writing to Accumulo using MapReduce. [PasswordToken]: {{ page.javadoc_core }}/org/apache/accumulo/core/client/security/tokens/PasswordToken.html [AuthenticationToken]: {{ page.javadoc_core }}/org/apache/accumulo/core/client/security/tokens/AuthenticationToken.html @@ -392,3 +285,6 @@ client.closeScanner(scanner); [BatchScanner]: {{ page.javadoc_core}}/org/apache/accumulo/core/client/BatchScanner.html [Range]: {{ page.javadoc_core }}/org/apache/accumulo/core/data/Range.html [WholeRowIterator]: {{ page.javadoc_core }}/org/apache/accumulo/core/iterators/user/WholeRowIterator.html +[Iterators]: {{ page.docs_baseurl }}/development/iterators +[Proxy]: {{ page.docs_baseurl }}/development/proxy +[MapReduce]: {{ page.docs_baseurl }}/development/mapreduce
