[GitHub] mikewalch closed pull request #140: Updated MapReduce docs with 2.0 changes

GitBox Fri, 04 Jan 2019 06:23:48 -0800

mikewalch closed pull request #140: Updated MapReduce docs with 2.0 changes
URL: https://github.com/apache/accumulo-website/pull/140


This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/_docs-2/development/high_speed_ingest.md 
b/_docs-2/development/high_speed_ingest.md
index ecf458b9..46fee585 100644
--- a/_docs-2/development/high_speed_ingest.md
+++ b/_docs-2/development/high_speed_ingest.md
@@ -112,7 +112,7 @@ on how use to use MapReduce with Accumulo, see the 
[MapReduce documentation][map
 and the [MapReduce example code][mapred-code].
 
 [bulk-example]: 
https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md
-[AccumuloOutputFormat]: {% jurl 
org.apache.accumulo.core.client.mapred.AccumuloOutputFormat %}
-[AccumuloFileOutputFormat]: {% jurl 
org.apache.accumulo.core.client.mapred.AccumuloFileOutputFormat %}
+[AccumuloOutputFormat]: {% jurl 
org.apache.accumulo.hadoop.mapreduce.AccumuloOutputFormat %}
+[AccumuloFileOutputFormat]: {% jurl 
org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %}
 [mapred-docs]: {% durl development/mapreduce %}
 [mapred-code]: 
https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md
diff --git a/_docs-2/development/mapreduce.md b/_docs-2/development/mapreduce.md
index ee6a93ae..7687ae86 100644
--- a/_docs-2/development/mapreduce.md
+++ b/_docs-2/development/mapreduce.md
@@ -4,18 +4,11 @@ category: development
 order: 2
 ---
 
-Accumulo tables can be used as the source and destination of MapReduce jobs. To
-use an Accumulo table with a MapReduce job, configure the job parameters to use
-the [AccumuloInputFormat] and [AccumuloOutputFormat]. Accumulo specific 
parameters
-can be set via these two format classes to do the following:
+Accumulo tables can be used as the source and destination of MapReduce jobs.
 
-* Authenticate and provide user credentials for the input
-* Restrict the scan to a range of rows
-* Restrict the input to a subset of available columns
+## General MapReduce configuration
 
-## Configuration
-
-Since 2.0.0, Accumulo no longer has the same versions of dependencies (i.e 
Guava, etc) as Hadoop.
+Since 2.0.0, Accumulo no longer has the same dependency versions (i.e Guava, 
etc) as Hadoop.
 When launching a MapReduce job that reads or writes to Accumulo, you should 
build a shaded jar
 with all of your dependencies and complete the following steps so YARN only 
includes Hadoop code
 (and not all of Hadoop dependencies) when running your MapReduce job:
@@ -28,163 +21,101 @@ with all of your dependencies and complete the following 
steps so YARN only incl
     job.getConfiguration().set("mapreduce.job.classloader", "true");
     ```
 
-## Mapper and Reducer classes
+## Read input from an Accumulo table
 
-To read from an Accumulo table create a Mapper with the following class
-parameterization and be sure to configure the [AccumuloInputFormat].
+Follow the steps below to create a MapReduce job that reads from an Accumulo 
table:
 
-```java
-class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> {
-    public void map(Key k, Value v, Context c) {
-        // transform key and value data here
-    }
-}
-```
-
-To write to an Accumulo table, create a Reducer with the following class
-parameterization and be sure to configure the [AccumuloOutputFormat]. The key
-emitted from the Reducer identifies the table to which the mutation is sent. 
This
-allows a single Reducer to write to more than one table if desired. A default 
table
-can be configured using the AccumuloOutputFormat, in which case the output 
table
-name does not have to be passed to the Context object within the Reducer.
-
-```java
-class MyReducer extends Reducer<WritableComparable, Writable, Text, Mutation> {
-    public void reduce(WritableComparable key, Iterable<Text> values, Context 
c) {
-        Mutation m;
-        // create the mutation based on input key and value
-        c.write(new Text("output-table"), m);
+1. Create a Mapper with the following class parameterization.
+
+    ```java
+    class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> {
+        public void map(Key k, Value v, Context c) {
+            // transform key and value data here
+        }
     }
-}
-```
+    ```
 
-The Text object passed as the output should contain the name of the table to 
which
-this mutation should be applied. The Text can be null in which case the 
mutation
-will be applied to the default table name specified in the 
[AccumuloOutputFormat]
-options.
-
-## AccumuloInputFormat options
-
-The following code shows how to set up Accumulo
-
-```java
-Job job = new Job(getConf());
-ClientInfo info = Accumulo.newClient().to("myinstance","zoo1,zoo2")
-                        .as("user", "passwd").info()
-AccumuloInputFormat.setClientInfo(job, info);
-AccumuloInputFormat.setInputTableName(job, table);
-AccumuloInputFormat.setScanAuthorizations(job, new Authorizations());
-```
-
-**Optional Settings:**
-
-To restrict Accumulo to a set of row ranges:
-
-```java
-ArrayList<Range> ranges = new ArrayList<Range>();
-// populate array list of row ranges ...
-AccumuloInputFormat.setRanges(job, ranges);
-```
-
-To restrict Accumulo to a list of columns:
-
-```java
-ArrayList<Pair<Text,Text>> columns = new ArrayList<Pair<Text,Text>>();
-// populate list of columns
-AccumuloInputFormat.fetchColumns(job, columns);
-```
-
-To use a regular expression to match row IDs:
-
-```java
-IteratorSetting is = new IteratorSetting(30, RexExFilter.class);
-RegExFilter.setRegexs(is, ".*suffix", null, null, null, true);
-AccumuloInputFormat.addIterator(job, is);
-```
-
-## AccumuloMultiTableInputFormat options
-
-The [AccumuloMultiTableInputFormat] allows the scanning over multiple tables
-in a single MapReduce job. Separate ranges, columns, and iterators can be
-used for each table.
-
-```java
-InputTableConfig tableOneConfig = new InputTableConfig();
-InputTableConfig tableTwoConfig = new InputTableConfig();
-```
-
-To set the configuration objects on the job:
-
-```java
-Map<String, InputTableConfig> configs = new HashMap<String,InputTableConfig>();
-configs.put("table1", tableOneConfig);
-configs.put("table2", tableTwoConfig);
-AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs);
-```
-
-**Optional settings:**
-
-To restrict to a set of ranges:
-
-```java
-ArrayList<Range> tableOneRanges = new ArrayList<Range>();
-ArrayList<Range> tableTwoRanges = new ArrayList<Range>();
-// populate array lists of row ranges for tables...
-tableOneConfig.setRanges(tableOneRanges);
-tableTwoConfig.setRanges(tableTwoRanges);
-```
-
-To restrict Accumulo to a list of columns:
-
-```java
-ArrayList<Pair<Text,Text>> tableOneColumns = new ArrayList<Pair<Text,Text>>();
-ArrayList<Pair<Text,Text>> tableTwoColumns = new ArrayList<Pair<Text,Text>>();
-// populate lists of columns for each of the tables ...
-tableOneConfig.fetchColumns(tableOneColumns);
-tableTwoConfig.fetchColumns(tableTwoColumns);
-```
-
-To set scan iterators:
-
-```java
-List<IteratorSetting> tableOneIterators = new ArrayList<IteratorSetting>();
-List<IteratorSetting> tableTwoIterators = new ArrayList<IteratorSetting>();
-// populate the lists of iterator settings for each of the tables ...
-tableOneConfig.setIterators(tableOneIterators);
-tableTwoConfig.setIterators(tableTwoIterators);
-```
-
-The name of the table can be retrieved from the input split:
-
-```java
-class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> {
-    public void map(Key k, Value v, Context c) {
-        RangeInputSplit split = (RangeInputSplit)c.getInputSplit();
-        String tableName = split.getTableName();
-        // do something with table name
+2. Configure your MapReduce job to use [AccumuloInputFormat].
+
+    ```java
+    Job job = Job.getInstance(getConf());
+    job.setInputFormatClass(AccumuloInputFormat.class);
+    Properties props = 
Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
+                            .as("user", "passwd").build();
+    
AccumuloInputFormat.configure().clientProperties(props).table(table).store(job);
+    ```
+    [AccumuloInputFormat] has optional settings.
+    ```java
+    List<Range> ranges = new ArrayList<Range>();
+    List<Pair<Text,Text>> columns = new ArrayList<Pair<Text,Text>>();
+    // populate ranges & columns
+    IteratorSetting is = new IteratorSetting(30, RexExFilter.class);
+    RegExFilter.setRegexs(is, ".*suffix", null, null, null, true);
+
+    AccumuloInputFormat.configure().clientProperties(props).table(table)
+        .auths(Authorizations.EMPTY) // optional: default to user's auths if 
not set
+        .ranges(ranges)              // optional: only read specified ranges
+        .fetchColumns(columns)       // optional: only read specified columns
+        .addIterator(is)             // optional: add iterator that matches 
row IDs
+        .store(job);
+    ```
+    [AccumuloInputFormat] can also be configured to read from multiple 
Accumulo tables.
+    ```java
+    Job job = Job.getInstance(getConf());
+    job.setInputFormatClass(AccumuloInputFormat.class);
+    Properties props = 
Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
+                            .as("user", "passwd").build();
+    AccumuloInputFormat.configure().clientProperties(props)
+        .table("table1").auths(Authorizations.EMPTY).ranges(tableOneRanges)
+        .table("table2").auths(Authorizations.EMPTY).ranges(tableTwoRanges)
+        .store(job);
+    ```
+    If reading from multiple tables, the table name can be retrieved from the 
input split:
+    ```java
+    class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> {
+        public void map(Key k, Value v, Context c) {
+            RangeInputSplit split = (RangeInputSplit)c.getInputSplit();
+            String tableName = split.getTableName();
+            // do something with table name
+        }
     }
-}
-```
+    ```
 
-## AccumuloOutputFormat options
+## Write output to an Accumulo table
 
-```java
-ClientInfo info = Accumulo.newClient().to("myinstance","zoo1,zoo2")
-                        .as("user", "passwd").info()
-AccumuloOutputFormat.setClientInfo(job, info);
-AccumuloOutputFormat.setDefaultTableName(job, "mytable");
-```
+Follow the steps below to write to an Accumulo table from a MapReduce job.
 
-**Optional Settings:**
+1. Create a Reducer with the following class parameterization. The key emitted 
from
+    the Reducer identifies the table to which the mutation is sent. This 
allows a single
+    Reducer to write to more than one table if desired. A default table can be 
configured
+    using the [AccumuloOutputFormat], in which case the output table name does 
not have to
+    be passed to the Context object within the Reducer.
+    ```java
+    class MyReducer extends Reducer<WritableComparable, Writable, Text, 
Mutation> {
+        public void reduce(WritableComparable key, Iterable<Text> values, 
Context c) {
+            Mutation m;
+            // create the mutation based on input key and value
+            c.write(new Text("output-table"), m);
+        }
+    }
+    ```
+    The Text object passed as the output should contain the name of the table 
to which
+    this mutation should be applied. The Text can be null in which case the 
mutation
+    will be applied to the default table name specified in the 
[AccumuloOutputFormat]
+    options.
 
-```java
-AccumuloOutputFormat.setMaxLatency(job, 300000); // milliseconds
-AccumuloOutputFormat.setMaxMutationBufferSize(job, 50000000); // bytes
-```
+2. Configure your MapReduce job to use [AccumuloOutputFormat].
+    ```java
+    Job job = Job.getInstance(getConf());
+    job.setOutputFormatClass(AccumuloOutputFormat.class);
+    Properties props = 
Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
+                            .as("user", "passwd").build();
+    AccumuloOutputFormat.configure().clientProperties(props)
+        .defaultTable("mytable").store(job);
+    ```
 
 The [MapReduce example][mapred-example] contains a complete example of using 
MapReduce with Accumulo.
 
 [mapred-example]: 
https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md
-[AccumuloInputFormat]: {% jurl 
org.apache.accumulo.core.client.mapred.AccumuloInputFormat %}
-[AccumuloMultiTableInputFormat]: {% jurl 
org.apache.accumulo.core.client.mapred.AccumuloMultiTableInputFormat %}
-[AccumuloOutputFormat]: {% jurl 
org.apache.accumulo.core.client.mapred.AccumuloOutputFormat %}
+[AccumuloInputFormat]: {% jurl 
org.apache.accumulo.hadoop.mapreduce.AccumuloInputFormat %}
+[AccumuloOutputFormat]: {% jurl 
org.apache.accumulo.hadoop.mapreduce.AccumuloOutputFormat %}
diff --git a/_docs-2/development/sampling.md b/_docs-2/development/sampling.md
index cde4642c..4d586d3c 100644
--- a/_docs-2/development/sampling.md
+++ b/_docs-2/development/sampling.md
@@ -52,8 +52,8 @@ Sample data can also be scanned from within an Accumulo 
[SortedKeyValueIterator]
 To see how to do this, look at the example iterator referenced in the 
[sampling example][example].
 Also, consult the javadoc on 
[IteratorEnvironment.cloneWithSamplingEnabled()][clone-sampling].
 
-Map reduce jobs using the [AccumuloInputFormat] can also read sample data.  See
-the javadoc for the `setSamplerConfiguration()` method of 
[AccumuloInputFormat].
+MapReduce jobs using the [AccumuloInputFormat] can also read sample data.  See 
the javadoc
+for `samplerConfiguration()` in the `configure()` method of 
[AccumuloInputFormat].
 
 Scans over sample data will throw a [SampleNotPresentException] in the 
following cases :
 
@@ -67,7 +67,7 @@ generated with the same configuration.
 ## Bulk import
 
 When generating rfiles to bulk import into Accumulo, those rfiles can contain
-sample data.  To use this feature, look at the javadoc of the `setSampler(...)`
+sample data.  To use this feature, look at the javadoc of `sampler()` in the 
`configure()`
 method of [AccumuloFileOutputFormat].
 
 [example]: 
https://github.com/apache/accumulo-examples/blob/master/docs/sample.md
@@ -75,8 +75,8 @@ method of [AccumuloFileOutputFormat].
 [sample-package]: {% jurl org.apache.accumulo.core.client.sample %}
 [skv-iterator]: {% jurl 
org.apache.accumulo.core.iterators.SortedKeyValueIterator %}
 [clone-sampling]: {% jurl 
org.apache.accumulo.core.iterators.IteratorEnvironment#cloneWithSamplingEnabled--
 %}
-[AccumuloInputFormat]: {% jurl 
org.apache.accumulo.core.client.mapred.AccumuloInputFormat %}
-[AccumuloFileOutputFormat]: {% jurl 
org.apache.accumulo.core.client.mapred.AccumuloFileOutputFormat %}
+[AccumuloInputFormat]: {% jurl 
org.apache.accumulo.hadoop.mapreduce.AccumuloInputFormat %}
+[AccumuloFileOutputFormat]: {% jurl 
org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %}
 [SampleNotPresentException]: {% jurl 
org.apache.accumulo.core.client.SampleNotPresentException %}
 [BatchScanner]: {% jurl org.apache.accumulo.core.client.BatchScanner %}
 [Scanner]: {% jurl org.apache.accumulo.core.client.Scanner %}
diff --git a/_docs-2/development/summaries.md b/_docs-2/development/summaries.md
index d68a570e..40f6c1e6 100644
--- a/_docs-2/development/summaries.md
+++ b/_docs-2/development/summaries.md
@@ -63,8 +63,8 @@ requires a special permission.  User must have the table 
permission
 ## Bulk import
 
 When generating RFiles to bulk import into Accumulo, those RFiles can contain
-summary data.  To use this feature, look at the javadoc on the
-`AccumuloFileOutputFormat.setSummarizers(...)` method.  Also, the {% jlink 
org.apache.accumulo.core.client.rfile.RFile %}
+summary data.  To use this feature, look at the javadoc of `summarizers()` in 
the `configure()` method
+of AccumuloFileOutputFormat.  Also, the {% jlink 
org.apache.accumulo.core.client.rfile.RFile %}
 class has options for creating RFiles with embedded summary data.
 
 ## Examples
@@ -218,3 +218,4 @@ root@uno summary_test> summaries
 root@uno summary_test>   
 ```
 
+[AccumuloFileOutputFormat]: {% jurl 
org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %}
diff --git a/_docs-2/security/kerberos.md b/_docs-2/security/kerberos.md
index 716f630b..2535935b 100644
--- a/_docs-2/security/kerberos.md
+++ b/_docs-2/security/kerberos.md
@@ -390,14 +390,14 @@ KerberosToken kt = new KerberosToken();
 AccumuloClient client = Accumulo.newClient().to("myinstance", "zoo1,zoo2")
                           .as(principal, kt).build();
 DelegationToken dt = client.securityOperations().getDelegationToken();
-AccumuloClient client2 = client.changeUser(principal, dt);
-ClientInfo info2 = client2.info();
+Properties props = Accumulo.newClientProperties().from(client.properties())
+                          .as(principal, dt).build();
 
 // Reading from Accumulo
-AccumuloInputFormat.setClientInfo(job, info2);
+AccumuloInputFormat.configure().clientProperties(props).store(job);
 
 // Writing to Accumulo
-AccumuloOutputFormat.setClientInfo(job, info2);
+AccumuloOutputFormat.configure().clientProperties(props).store(job);
 ```
 
 Users must have the `DELEGATION_TOKEN` system permission to call the 
`getDelegationToken`
diff --git a/_docs-2/security/on-disk-encryption.md 
b/_docs-2/security/on-disk-encryption.md
index e7be37bf..70467677 100644
--- a/_docs-2/security/on-disk-encryption.md
+++ b/_docs-2/security/on-disk-encryption.md
@@ -78,8 +78,8 @@ its the additional data that gets encrypted on disk that 
could be exposed in a l
 
 ### Bulk Import
 
-There are 2 ways to create RFiles for bulk ingest: with the [RFile API][rfile] 
and during Map Reduce using [AccumuloOutputFormat].  
-The [RFile API][rfile] allows passing in the configuration properties for 
encryption mentioned above.  The [AccumuloOutputFormat] does 
+There are 2 ways to create RFiles for bulk ingest: with the [RFile API][rfile] 
and during Map Reduce using [AccumuloFileOutputFormat].  
+The [RFile API][rfile] allows passing in the configuration properties for 
encryption mentioned above.  The [AccumuloFileOutputFormat] does 
 not allow for encryption of RFiles so any data bulk imported through this 
process will be unencrypted.
 
 ### Zookeeper
@@ -104,4 +104,4 @@ As you can see, there is a significant performance hit when 
running without the
 [Kerberos]: {% durl security/kerberos %}
 [design]: {% durl getting-started/design#rfile %}
 [rfile]: {% jurl org.apache.accumulo.core.client.rfile.RFile %}
-[AccumuloOutputFormat]: {% jurl 
org.apache.accumulo.core.client.mapred.AccumuloOutputFormat %}
+[AccumuloFileOutputFormat]: {% jurl 
org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %}
diff --git a/_plugins/links.rb b/_plugins/links.rb
index 2f9dc3ff..f2278901 100755
--- a/_plugins/links.rb
+++ b/_plugins/links.rb
@@ -43,8 +43,8 @@ def render_javadoc(context, text, url_only)
   jmodule = 'accumulo-' + clz.split('.')[3]
   if clz.start_with?('org.apache.accumulo.server')
     jmodule = 'accumulo-server-base'
-  elsif clz.start_with?('org.apache.accumulo.core.client.mapred')
-    jmodule = 'accumulo-client-mapreduce'
+  elsif clz.start_with?('org.apache.accumulo.hadoop.mapred')
+    jmodule = 'accumulo-hadoop-mapreduce'
   elsif clz.start_with?('org.apache.accumulo.iteratortest')
     jmodule = 'accumulo-iterator-test-harness'
   end


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] mikewalch closed pull request #140: Updated MapReduce docs with 2.0 changes

Reply via email to