Repository: accumulo-website Updated Branches: refs/heads/asf-site 9ebc5f9a1 -> 3f99b6cc9 refs/heads/master 817a0ef72 -> e0da132ec
Documentation updates * Added javadocs links to iterators.md * Fixed headers on proxy.md Project: http://git-wip-us.apache.org/repos/asf/accumulo-website/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo-website/commit/e0da132e Tree: http://git-wip-us.apache.org/repos/asf/accumulo-website/tree/e0da132e Diff: http://git-wip-us.apache.org/repos/asf/accumulo-website/diff/e0da132e Branch: refs/heads/master Commit: e0da132ec4ace14e3745019f189f0e4b9454927b Parents: 817a0ef Author: Mike Walch <[email protected]> Authored: Fri May 26 10:55:06 2017 -0400 Committer: Mike Walch <[email protected]> Committed: Fri May 26 10:55:06 2017 -0400 ---------------------------------------------------------------------- _docs-unreleased/development/iterators.md | 73 ++++++++++++++------------ _docs-unreleased/development/proxy.md | 12 ++--- 2 files changed, 44 insertions(+), 41 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/e0da132e/_docs-unreleased/development/iterators.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/iterators.md b/_docs-unreleased/development/iterators.md index 947d5e0..2e1b242 100644 --- a/_docs-unreleased/development/iterators.md +++ b/_docs-unreleased/development/iterators.md @@ -4,7 +4,7 @@ category: development order: 1 --- -Accumulo SortedKeyValueIterators, commonly referred to as **Iterators** for short, are server-side programming constructs +Accumulo [SortedKeyValueIterators][SortedKeyValueIterator], commonly referred to as **Iterators** for short, are server-side programming constructs that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers. The name rightly brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element @@ -16,7 +16,7 @@ merge multiple Iterators into a single view. In this sense, a collection of Iter a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators are not designed to act as triggers nor are they designed to operate outside of the purview of a single table. -Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is +Understanding how TabletServers invoke the methods on a [SortedKeyValueIterator] can be obtuse as the actual code is buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong understanding of this as the interface provides clear definitions about what each action each method should take. This chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common @@ -37,7 +37,7 @@ Iterators must have a public no-args constructor. ## Interface -A normal implementation of the SortedKeyValueIterator defines functionality for the following methods: +A normal implementation of the [SortedKeyValueIterator] defines functionality for the following methods: ```java void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException; @@ -68,7 +68,7 @@ These options allow for Iterators to dynamically configure themselves on the fly (a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter Key-Value pairs in a regular expression Iterator. -The third argument, the `IteratorEnvironment`, is a special object which provides information to this Iterator about the +The third argument, the [IteratorEnvironment], is a special object which provides information to this Iterator about the context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan (which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to @@ -79,7 +79,7 @@ optimize itself. The `seek` method is likely the most confusing method on the Iterator interface. The purpose of this method is to advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before the implementation of this method returns some additional processing is performed which may further advance the current -position past the `startKey` of the `Range`. This, however, is dependent on the functionality the iterator provides. For +position past the `startKey` of the [Range]. This, however, is dependent on the functionality the iterator provides. For example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before `seek` returns. The important condition for `seek` to meet is that this Iterator should be ready to return the first Key-Value pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by `getTopKey` @@ -88,8 +88,8 @@ a Key-Value pair to return. The arguments passed to seek are as follows: -The TabletServer first provides a `Range`, an object which defines some collection of Accumulo `Key`s, which defines the -Key-Value pairs that this Iterator should return. Each `Range` has a `startKey` and `endKey` with an inclusive flag for +The TabletServer first provides a [Range], an object which defines some collection of Accumulo `Key`s, which defines the +Key-Value pairs that this Iterator should return. Each [Range] has a `startKey` and `endKey` with an inclusive flag for both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s) @@ -101,12 +101,12 @@ should be treated as an inclusion collection (true) or an exclusion collection ( It is likely that all implementations of `seek` will first make a call to the `seek` method on the "source" Iterator that was provided in the `init` method. The collection of column families and -the boolean `include` argument should be passed down as well as the `Range`. Somewhat commonly, the Iterator will +the boolean `include` argument should be passed down as well as the [Range]. Somewhat commonly, the Iterator will also implement some sort of additional logic to find or compute the first Key-Value pair in the provided Range. For example, a regular expression Iterator would consume all records which do not match the given pattern before returning from `seek`. -It is important to retain the original Range passed to this method to know when this Iterator should stop +It is important to retain the original [Range] passed to this method to know when this Iterator should stop reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it will result in duplicate keys emitting from a BatchScan if the scanned table has more than one tablet. Best practice is to never emit entries outside the seek range. @@ -156,12 +156,12 @@ The `deepCopy` method is similar to the `clone` method from the Java `Cloneable` Implementations of this method should return a new object of the same type as the Accumulo Iterator instance it was called on. Any internal state from the instance `deepCopy` was called on should be carried over to the returned copy. The returned copy should be ready to have -`seek` called on it. The SortedKeyValueIterator interface guarantees that `init` will be called on +`seek` called on it. The [SortedKeyValueIterator] interface guarantees that `init` will be called on an iterator before `deepCopy` and that `init` will not be called on the iterator returned by `deepCopy`. Typically, implementations of `deepCopy` call a copy-constructor which will initialize -internal data structures. As with `seek`, it is common for the `IteratorEnvironment` +internal data structures. As with `seek`, it is common for the [IteratorEnvironment] argument to be ignored as most Iterator implementations can be written without the explicit information the environment provides. @@ -246,18 +246,18 @@ next possible row. ## Abstract Iterators A number of Abstract implementations of Iterators are provided to allow for faster creation -of common patterns. The most commonly used abstract implementations are the `Filter` and -`Combiner` classes. When possible these classes should be used instead as they have been +of common patterns. The most commonly used abstract implementations are the [Filter] and +[Combiner] classes. When possible these classes should be used instead as they have been thoroughly tested inside Accumulo itself. ### Filter -The `Filter` abstract Iterator provides a very simple implementation which allows implementations +The [Filter] abstract Iterator provides a very simple implementation which allows implementations to define whether or not a Key-Value pair should be returned via an `accept(Key, Value)` method. Filters are extremely simple to implement; however, when the implementation is filtering a large percentage of Key-Value pairs with respect to the total number of pairs examined, -it can be very inefficient. For example, if a Filter implementation can determine after examining +it can be very inefficient. For example, if a [Filter] implementation can determine after examining part of the row that no other pairs in this row will be accepted, there is no mechanism to efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of 1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined @@ -266,30 +266,30 @@ remaining 990 Key-Value pairs in this row. Another way to express this deficienc Filters have no means to leverage the `seek` method to efficiently skip large portions of Key-Value pairs. -As such, the `Filter` class functions well for filtering small amounts of data, but is -inefficient for filtering large amounts of data. The decision to use a `Filter` strongly +As such, the [Filter] class functions well for filtering small amounts of data, but is +inefficient for filtering large amounts of data. The decision to use a Filter strongly depends on the use case and distribution of data being filtered. ### Combiner -The `Combiner` class is another common abstract Iterator. Similar to the `Combiner` interface +The [Combiner] class is another common abstract Iterator. Similar to the `Combiner` interface define in Hadoop's MapReduce framework, implementations of this abstract class reduce multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair. Combiners provide a simple way to implement common operations like summation and aggregation without the need to implement the entire Accumulo Iterator interface. -One important consideration when choosing to design a Combiner is that the "reduction" operation +One important consideration when choosing to design a [Combiner] is that the "reduction" operation is often best represented when it is associative and commutative. Operations which do not meet these criteria can be implemented; however, the implementation can be difficult. -A second consideration is that a Combiner is not guaranteed to see every Key-Value pair +A second consideration is that a [Combiner] is not guaranteed to see every Key-Value pair which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the timestamps 1, 2, 4 and 5. -Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the +Finally, when configuring an Accumulo table to use a [Combiner], be sure to disable the Versioning Iterator or set the Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value pair that has the largest timestamp. @@ -297,7 +297,7 @@ pair that has the largest timestamp. #### Combiner Applications Many applications can benefit from the ability to aggregate values across common -keys. This can be done via Combiner iterators and is similar to the Reduce step in +keys. This can be done via [Combiner] iterators and is similar to the Reduce step in MapReduce. This provides the ability to define online, incrementally updated analytics without the overhead or latency associated with batch-oriented MapReduce jobs. @@ -322,16 +322,16 @@ combining iterator. ## Best practices -Because of the flexibility that the `SortedKeyValueInterface` provides, it doesn't directly disallow +Because of the flexibility that the [SortedKeyValueInterface] provides, it doesn't directly disallow many implementations which are poor design decisions. The following are some common recommendations to follow and pitfalls to avoid in Iterator implementations. #### Avoid special logic encoded in Ranges Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` or `BatchScanner` are unmodified. -If a `Range` falls within the boundaries of a Tablet, an Iterator will often see that same Range in the -`seek` method. However, there is no guarantee that the `Range` will remain unaltered from client to server. As such, Iterators -should *never* make assumptions about the current state/context based on the `Range`. +If a [Range] falls within the boundaries of a Tablet, an Iterator will often see that same Range in the +`seek` method. However, there is no guarantee that the [Range] will remain unaltered from client to server. As such, Iterators +should *never* make assumptions about the current state/context based on the Range. The common failure condition is referred to as a "re-seek". In the context of a Scan, TabletServers construct the "stack" of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value @@ -342,24 +342,24 @@ the point to resume the iteration (to avoid returning duplicate Key-Value pairs) from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan, non-inclusive. -### `seek`'ing backwards +### seeking backwards The ability for an Iterator to "skip over" large blocks of Key-Value pairs is a major tenet behind Iterators. By `seek`'ing when it is known that there is a collection of Key-Value pairs which can be ignored can greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed. -While the `seek` method provides the `Range` that should be used to `seek` the underlying source Iterator, -there is no guarantee that the implementing Iterator uses that `Range` to perform the `seek` on its -"source" Iterator. As such, it is possible to seek to any `Range` and the interface has no assertions +While the `seek` method provides the [Range] that should be used to `seek` the underlying source Iterator, +there is no guarantee that the implementing Iterator uses that Range to perform the `seek` on its +"source" Iterator. As such, it is possible to seek to any Range and the interface has no assertions to prevent this from happening. Since Iterators are allowed to `seek` to arbitrary Keys, it also allows Iterators to create infinite loops -inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should +inside Scans that will repeatedly read the same data without end. If an arbitrary [Range] is constructed, it should construct a completely new Range as it allows for bugs to be introduced which will break Accumulo. Thus, `seek`'s should always be thought of as making "forward progress" in the view of the total iteration. The -`startKey` of a `Range` should always be greater than the current Key seen by the Iterator while the `endKey` of the -`Range` should always retain the original `endKey` (and `endKey` inclusivity) of the last `Range` seen by your +`startKey` of a [Range] should always be greater than the current Key seen by the Iterator while the `endKey` of the +Range should always retain the original `endKey` (and `endKey` inclusivity) of the last Range seen by your Iterator's implementation of seek. ### Take caution in constructing new data in an Iterator @@ -407,7 +407,7 @@ to make different assertions than those who only operate at scan time. Iterators Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite all files (only a subset of them), it is possible that the logic take this into consideration. -For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The +For example, a [Combiner] that runs over data at during compactions, might not see all of the values for a given Key. The Combiner must recognize this and not perform any function that would be incorrect due to the missing values. @@ -416,4 +416,9 @@ to the missing values. The [Iterator test harness][iterator-test-harness] is generalized testing framework for Accumulo Iterators that can identify common pitfalls in user-created Iterators. +[SortedKeyValueIterator]: {{ page.javadoc_core }}/org/apache/accumulo/core/iterators/SortedKeyValueIterator.html +[IteratorEnvironment]: {{ page.javadoc_core }}/org/apache/accumulo/core/iterators/IteratorEnvironment.html +[Filter]: {{ page.javadoc_core }}/org/apache/accumulo/core/iterators/Filter.html +[Combiner]: {{ page.javadoc_core }}/org/apache/accumulo/core/iterators/Combiner.html +[Range]: {{ page.javadoc_core }}/org/apache/accumulo/core/data/Range.html [iterator-test-harness]: {{ page.docs_baseurl }}/development/development_tools#iterator-test-harness http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/e0da132e/_docs-unreleased/development/proxy.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/proxy.md b/_docs-unreleased/development/proxy.md index 6e9f7eb..f3e8f3a 100644 --- a/_docs-unreleased/development/proxy.md +++ b/_docs-unreleased/development/proxy.md @@ -4,21 +4,19 @@ category: development order: 3 --- -## Proxy - The proxy API allows the interaction with Accumulo with languages other than Java. A proxy server is provided in the codebase and a client can further be generated. The proxy API can also be used instead of the traditional ZooKeeperInstance class to provide a single TCP port in which clients can be securely routed through a firewall, without requiring access to all tablet servers in the cluster. -### Prerequisites +## Prerequisites The proxy server can live on any node in which the basic client API would work. That means it must be able to communicate with the Master, ZooKeepers, NameNode, and the DataNodes. A proxy client only needs the ability to communicate with the proxy server. -### Configuration +## Configuration The configuration options for the proxy server live inside of a properties file. At the very least, you need to supply the following properties: @@ -34,7 +32,7 @@ You can find a sample configuration file in your distribution at `proxy/proxy.pr This sample configuration file further demonstrates an ability to back the proxy server by MockAccumulo or the MiniAccumuloCluster. -### Running the Proxy Server +## Running the Proxy Server After the properties file holding the configuration is created, the proxy server can be started using the following command in the Accumulo distribution (assuming @@ -42,7 +40,7 @@ your properties file is named `config.properties`): accumulo proxy -p config.properties -### Creating a Proxy Client +## Creating a Proxy Client Aside from installing the Thrift compiler, you will also need the language-specific library for Thrift installed to generate client code in that language. Typically, your operating @@ -54,7 +52,7 @@ You can find the thrift file for generating the client at `proxy/proxy.thrift`. After a client is generated, the port specified in the configuration properties above will be used to connect to the server. -### Using a Proxy Client +## Using a Proxy Client The following examples have been written in Java and the method signatures may be slightly different depending on the language specified when generating client with
