mikewalch closed pull request #140: Updated MapReduce docs with 2.0 changes URL: https://github.com/apache/accumulo-website/pull/140
This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/_docs-2/development/high_speed_ingest.md b/_docs-2/development/high_speed_ingest.md index ecf458b9..46fee585 100644 --- a/_docs-2/development/high_speed_ingest.md +++ b/_docs-2/development/high_speed_ingest.md @@ -112,7 +112,7 @@ on how use to use MapReduce with Accumulo, see the [MapReduce documentation][map and the [MapReduce example code][mapred-code]. [bulk-example]: https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md -[AccumuloOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloOutputFormat %} -[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloFileOutputFormat %} +[AccumuloOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloOutputFormat %} +[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %} [mapred-docs]: {% durl development/mapreduce %} [mapred-code]: https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md diff --git a/_docs-2/development/mapreduce.md b/_docs-2/development/mapreduce.md index ee6a93ae..7687ae86 100644 --- a/_docs-2/development/mapreduce.md +++ b/_docs-2/development/mapreduce.md @@ -4,18 +4,11 @@ category: development order: 2 --- -Accumulo tables can be used as the source and destination of MapReduce jobs. To -use an Accumulo table with a MapReduce job, configure the job parameters to use -the [AccumuloInputFormat] and [AccumuloOutputFormat]. Accumulo specific parameters -can be set via these two format classes to do the following: +Accumulo tables can be used as the source and destination of MapReduce jobs. -* Authenticate and provide user credentials for the input -* Restrict the scan to a range of rows -* Restrict the input to a subset of available columns +## General MapReduce configuration -## Configuration - -Since 2.0.0, Accumulo no longer has the same versions of dependencies (i.e Guava, etc) as Hadoop. +Since 2.0.0, Accumulo no longer has the same dependency versions (i.e Guava, etc) as Hadoop. When launching a MapReduce job that reads or writes to Accumulo, you should build a shaded jar with all of your dependencies and complete the following steps so YARN only includes Hadoop code (and not all of Hadoop dependencies) when running your MapReduce job: @@ -28,163 +21,101 @@ with all of your dependencies and complete the following steps so YARN only incl job.getConfiguration().set("mapreduce.job.classloader", "true"); ``` -## Mapper and Reducer classes +## Read input from an Accumulo table -To read from an Accumulo table create a Mapper with the following class -parameterization and be sure to configure the [AccumuloInputFormat]. +Follow the steps below to create a MapReduce job that reads from an Accumulo table: -```java -class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> { - public void map(Key k, Value v, Context c) { - // transform key and value data here - } -} -``` - -To write to an Accumulo table, create a Reducer with the following class -parameterization and be sure to configure the [AccumuloOutputFormat]. The key -emitted from the Reducer identifies the table to which the mutation is sent. This -allows a single Reducer to write to more than one table if desired. A default table -can be configured using the AccumuloOutputFormat, in which case the output table -name does not have to be passed to the Context object within the Reducer. - -```java -class MyReducer extends Reducer<WritableComparable, Writable, Text, Mutation> { - public void reduce(WritableComparable key, Iterable<Text> values, Context c) { - Mutation m; - // create the mutation based on input key and value - c.write(new Text("output-table"), m); +1. Create a Mapper with the following class parameterization. + + ```java + class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> { + public void map(Key k, Value v, Context c) { + // transform key and value data here + } } -} -``` + ``` -The Text object passed as the output should contain the name of the table to which -this mutation should be applied. The Text can be null in which case the mutation -will be applied to the default table name specified in the [AccumuloOutputFormat] -options. - -## AccumuloInputFormat options - -The following code shows how to set up Accumulo - -```java -Job job = new Job(getConf()); -ClientInfo info = Accumulo.newClient().to("myinstance","zoo1,zoo2") - .as("user", "passwd").info() -AccumuloInputFormat.setClientInfo(job, info); -AccumuloInputFormat.setInputTableName(job, table); -AccumuloInputFormat.setScanAuthorizations(job, new Authorizations()); -``` - -**Optional Settings:** - -To restrict Accumulo to a set of row ranges: - -```java -ArrayList<Range> ranges = new ArrayList<Range>(); -// populate array list of row ranges ... -AccumuloInputFormat.setRanges(job, ranges); -``` - -To restrict Accumulo to a list of columns: - -```java -ArrayList<Pair<Text,Text>> columns = new ArrayList<Pair<Text,Text>>(); -// populate list of columns -AccumuloInputFormat.fetchColumns(job, columns); -``` - -To use a regular expression to match row IDs: - -```java -IteratorSetting is = new IteratorSetting(30, RexExFilter.class); -RegExFilter.setRegexs(is, ".*suffix", null, null, null, true); -AccumuloInputFormat.addIterator(job, is); -``` - -## AccumuloMultiTableInputFormat options - -The [AccumuloMultiTableInputFormat] allows the scanning over multiple tables -in a single MapReduce job. Separate ranges, columns, and iterators can be -used for each table. - -```java -InputTableConfig tableOneConfig = new InputTableConfig(); -InputTableConfig tableTwoConfig = new InputTableConfig(); -``` - -To set the configuration objects on the job: - -```java -Map<String, InputTableConfig> configs = new HashMap<String,InputTableConfig>(); -configs.put("table1", tableOneConfig); -configs.put("table2", tableTwoConfig); -AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs); -``` - -**Optional settings:** - -To restrict to a set of ranges: - -```java -ArrayList<Range> tableOneRanges = new ArrayList<Range>(); -ArrayList<Range> tableTwoRanges = new ArrayList<Range>(); -// populate array lists of row ranges for tables... -tableOneConfig.setRanges(tableOneRanges); -tableTwoConfig.setRanges(tableTwoRanges); -``` - -To restrict Accumulo to a list of columns: - -```java -ArrayList<Pair<Text,Text>> tableOneColumns = new ArrayList<Pair<Text,Text>>(); -ArrayList<Pair<Text,Text>> tableTwoColumns = new ArrayList<Pair<Text,Text>>(); -// populate lists of columns for each of the tables ... -tableOneConfig.fetchColumns(tableOneColumns); -tableTwoConfig.fetchColumns(tableTwoColumns); -``` - -To set scan iterators: - -```java -List<IteratorSetting> tableOneIterators = new ArrayList<IteratorSetting>(); -List<IteratorSetting> tableTwoIterators = new ArrayList<IteratorSetting>(); -// populate the lists of iterator settings for each of the tables ... -tableOneConfig.setIterators(tableOneIterators); -tableTwoConfig.setIterators(tableTwoIterators); -``` - -The name of the table can be retrieved from the input split: - -```java -class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> { - public void map(Key k, Value v, Context c) { - RangeInputSplit split = (RangeInputSplit)c.getInputSplit(); - String tableName = split.getTableName(); - // do something with table name +2. Configure your MapReduce job to use [AccumuloInputFormat]. + + ```java + Job job = Job.getInstance(getConf()); + job.setInputFormatClass(AccumuloInputFormat.class); + Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2") + .as("user", "passwd").build(); + AccumuloInputFormat.configure().clientProperties(props).table(table).store(job); + ``` + [AccumuloInputFormat] has optional settings. + ```java + List<Range> ranges = new ArrayList<Range>(); + List<Pair<Text,Text>> columns = new ArrayList<Pair<Text,Text>>(); + // populate ranges & columns + IteratorSetting is = new IteratorSetting(30, RexExFilter.class); + RegExFilter.setRegexs(is, ".*suffix", null, null, null, true); + + AccumuloInputFormat.configure().clientProperties(props).table(table) + .auths(Authorizations.EMPTY) // optional: default to user's auths if not set + .ranges(ranges) // optional: only read specified ranges + .fetchColumns(columns) // optional: only read specified columns + .addIterator(is) // optional: add iterator that matches row IDs + .store(job); + ``` + [AccumuloInputFormat] can also be configured to read from multiple Accumulo tables. + ```java + Job job = Job.getInstance(getConf()); + job.setInputFormatClass(AccumuloInputFormat.class); + Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2") + .as("user", "passwd").build(); + AccumuloInputFormat.configure().clientProperties(props) + .table("table1").auths(Authorizations.EMPTY).ranges(tableOneRanges) + .table("table2").auths(Authorizations.EMPTY).ranges(tableTwoRanges) + .store(job); + ``` + If reading from multiple tables, the table name can be retrieved from the input split: + ```java + class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> { + public void map(Key k, Value v, Context c) { + RangeInputSplit split = (RangeInputSplit)c.getInputSplit(); + String tableName = split.getTableName(); + // do something with table name + } } -} -``` + ``` -## AccumuloOutputFormat options +## Write output to an Accumulo table -```java -ClientInfo info = Accumulo.newClient().to("myinstance","zoo1,zoo2") - .as("user", "passwd").info() -AccumuloOutputFormat.setClientInfo(job, info); -AccumuloOutputFormat.setDefaultTableName(job, "mytable"); -``` +Follow the steps below to write to an Accumulo table from a MapReduce job. -**Optional Settings:** +1. Create a Reducer with the following class parameterization. The key emitted from + the Reducer identifies the table to which the mutation is sent. This allows a single + Reducer to write to more than one table if desired. A default table can be configured + using the [AccumuloOutputFormat], in which case the output table name does not have to + be passed to the Context object within the Reducer. + ```java + class MyReducer extends Reducer<WritableComparable, Writable, Text, Mutation> { + public void reduce(WritableComparable key, Iterable<Text> values, Context c) { + Mutation m; + // create the mutation based on input key and value + c.write(new Text("output-table"), m); + } + } + ``` + The Text object passed as the output should contain the name of the table to which + this mutation should be applied. The Text can be null in which case the mutation + will be applied to the default table name specified in the [AccumuloOutputFormat] + options. -```java -AccumuloOutputFormat.setMaxLatency(job, 300000); // milliseconds -AccumuloOutputFormat.setMaxMutationBufferSize(job, 50000000); // bytes -``` +2. Configure your MapReduce job to use [AccumuloOutputFormat]. + ```java + Job job = Job.getInstance(getConf()); + job.setOutputFormatClass(AccumuloOutputFormat.class); + Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2") + .as("user", "passwd").build(); + AccumuloOutputFormat.configure().clientProperties(props) + .defaultTable("mytable").store(job); + ``` The [MapReduce example][mapred-example] contains a complete example of using MapReduce with Accumulo. [mapred-example]: https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md -[AccumuloInputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloInputFormat %} -[AccumuloMultiTableInputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloMultiTableInputFormat %} -[AccumuloOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloOutputFormat %} +[AccumuloInputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloInputFormat %} +[AccumuloOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloOutputFormat %} diff --git a/_docs-2/development/sampling.md b/_docs-2/development/sampling.md index cde4642c..4d586d3c 100644 --- a/_docs-2/development/sampling.md +++ b/_docs-2/development/sampling.md @@ -52,8 +52,8 @@ Sample data can also be scanned from within an Accumulo [SortedKeyValueIterator] To see how to do this, look at the example iterator referenced in the [sampling example][example]. Also, consult the javadoc on [IteratorEnvironment.cloneWithSamplingEnabled()][clone-sampling]. -Map reduce jobs using the [AccumuloInputFormat] can also read sample data. See -the javadoc for the `setSamplerConfiguration()` method of [AccumuloInputFormat]. +MapReduce jobs using the [AccumuloInputFormat] can also read sample data. See the javadoc +for `samplerConfiguration()` in the `configure()` method of [AccumuloInputFormat]. Scans over sample data will throw a [SampleNotPresentException] in the following cases : @@ -67,7 +67,7 @@ generated with the same configuration. ## Bulk import When generating rfiles to bulk import into Accumulo, those rfiles can contain -sample data. To use this feature, look at the javadoc of the `setSampler(...)` +sample data. To use this feature, look at the javadoc of `sampler()` in the `configure()` method of [AccumuloFileOutputFormat]. [example]: https://github.com/apache/accumulo-examples/blob/master/docs/sample.md @@ -75,8 +75,8 @@ method of [AccumuloFileOutputFormat]. [sample-package]: {% jurl org.apache.accumulo.core.client.sample %} [skv-iterator]: {% jurl org.apache.accumulo.core.iterators.SortedKeyValueIterator %} [clone-sampling]: {% jurl org.apache.accumulo.core.iterators.IteratorEnvironment#cloneWithSamplingEnabled-- %} -[AccumuloInputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloInputFormat %} -[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloFileOutputFormat %} +[AccumuloInputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloInputFormat %} +[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %} [SampleNotPresentException]: {% jurl org.apache.accumulo.core.client.SampleNotPresentException %} [BatchScanner]: {% jurl org.apache.accumulo.core.client.BatchScanner %} [Scanner]: {% jurl org.apache.accumulo.core.client.Scanner %} diff --git a/_docs-2/development/summaries.md b/_docs-2/development/summaries.md index d68a570e..40f6c1e6 100644 --- a/_docs-2/development/summaries.md +++ b/_docs-2/development/summaries.md @@ -63,8 +63,8 @@ requires a special permission. User must have the table permission ## Bulk import When generating RFiles to bulk import into Accumulo, those RFiles can contain -summary data. To use this feature, look at the javadoc on the -`AccumuloFileOutputFormat.setSummarizers(...)` method. Also, the {% jlink org.apache.accumulo.core.client.rfile.RFile %} +summary data. To use this feature, look at the javadoc of `summarizers()` in the `configure()` method +of AccumuloFileOutputFormat. Also, the {% jlink org.apache.accumulo.core.client.rfile.RFile %} class has options for creating RFiles with embedded summary data. ## Examples @@ -218,3 +218,4 @@ root@uno summary_test> summaries root@uno summary_test> ``` +[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %} diff --git a/_docs-2/security/kerberos.md b/_docs-2/security/kerberos.md index 716f630b..2535935b 100644 --- a/_docs-2/security/kerberos.md +++ b/_docs-2/security/kerberos.md @@ -390,14 +390,14 @@ KerberosToken kt = new KerberosToken(); AccumuloClient client = Accumulo.newClient().to("myinstance", "zoo1,zoo2") .as(principal, kt).build(); DelegationToken dt = client.securityOperations().getDelegationToken(); -AccumuloClient client2 = client.changeUser(principal, dt); -ClientInfo info2 = client2.info(); +Properties props = Accumulo.newClientProperties().from(client.properties()) + .as(principal, dt).build(); // Reading from Accumulo -AccumuloInputFormat.setClientInfo(job, info2); +AccumuloInputFormat.configure().clientProperties(props).store(job); // Writing to Accumulo -AccumuloOutputFormat.setClientInfo(job, info2); +AccumuloOutputFormat.configure().clientProperties(props).store(job); ``` Users must have the `DELEGATION_TOKEN` system permission to call the `getDelegationToken` diff --git a/_docs-2/security/on-disk-encryption.md b/_docs-2/security/on-disk-encryption.md index e7be37bf..70467677 100644 --- a/_docs-2/security/on-disk-encryption.md +++ b/_docs-2/security/on-disk-encryption.md @@ -78,8 +78,8 @@ its the additional data that gets encrypted on disk that could be exposed in a l ### Bulk Import -There are 2 ways to create RFiles for bulk ingest: with the [RFile API][rfile] and during Map Reduce using [AccumuloOutputFormat]. -The [RFile API][rfile] allows passing in the configuration properties for encryption mentioned above. The [AccumuloOutputFormat] does +There are 2 ways to create RFiles for bulk ingest: with the [RFile API][rfile] and during Map Reduce using [AccumuloFileOutputFormat]. +The [RFile API][rfile] allows passing in the configuration properties for encryption mentioned above. The [AccumuloFileOutputFormat] does not allow for encryption of RFiles so any data bulk imported through this process will be unencrypted. ### Zookeeper @@ -104,4 +104,4 @@ As you can see, there is a significant performance hit when running without the [Kerberos]: {% durl security/kerberos %} [design]: {% durl getting-started/design#rfile %} [rfile]: {% jurl org.apache.accumulo.core.client.rfile.RFile %} -[AccumuloOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloOutputFormat %} +[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %} diff --git a/_plugins/links.rb b/_plugins/links.rb index 2f9dc3ff..f2278901 100755 --- a/_plugins/links.rb +++ b/_plugins/links.rb @@ -43,8 +43,8 @@ def render_javadoc(context, text, url_only) jmodule = 'accumulo-' + clz.split('.')[3] if clz.start_with?('org.apache.accumulo.server') jmodule = 'accumulo-server-base' - elsif clz.start_with?('org.apache.accumulo.core.client.mapred') - jmodule = 'accumulo-client-mapreduce' + elsif clz.start_with?('org.apache.accumulo.hadoop.mapred') + jmodule = 'accumulo-hadoop-mapreduce' elsif clz.start_with?('org.apache.accumulo.iteratortest') jmodule = 'accumulo-iterator-test-harness' end ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
