[08/11] accumulo git commit: ACCUMULO-4531 Remove bulkIngest.html

mwalch Tue, 06 Dec 2016 11:16:31 -0800

ACCUMULO-4531 Remove bulkIngest.html

* Similar documentation exists in 'High-speed Ingest' section of user manual



Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo
Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/c0e68a3a
Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/c0e68a3a
Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/c0e68a3a

Branch: refs/heads/master
Commit: c0e68a3a82bb6fa7eb370039537aac8337f22e47
Parents: 43e98d4
Author: Mike Walch <mwa...@apache.org>
Authored: Tue Dec 6 10:58:23 2016 -0500
Committer: Mike Walch <mwa...@apache.org>
Committed: Tue Dec 6 14:04:54 2016 -0500

----------------------------------------------------------------------
 docs/src/main/resources/bulkIngest.html | 114 ---------------------------
 1 file changed, 114 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/accumulo/blob/c0e68a3a/docs/src/main/resources/bulkIngest.html
----------------------------------------------------------------------
diff --git a/docs/src/main/resources/bulkIngest.html 
b/docs/src/main/resources/bulkIngest.html
deleted file mode 100644
index 9e9896e..0000000
--- a/docs/src/main/resources/bulkIngest.html
+++ /dev/null
@@ -1,114 +0,0 @@
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one or more
-  contributor license agreements.  See the NOTICE file distributed with
-  this work for additional information regarding copyright ownership.
-  The ASF licenses this file to You under the Apache License, Version 2.0
-  (the "License"); you may not use this file except in compliance with
-  the License.  You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License.
--->
-<html>
-<head>
-<title>Accumulo Bulk Ingest</title>
-<link rel='stylesheet' type='text/css' href='documentation.css' 
media='screen'/>
-</head>
-<body>
-
-<h1>Apache Accumulo Documentation : Bulk Ingest</h2>
-
-<p>Accumulo supports the ability to import sorted files produced by an
-external process into an online table. Often, it is much faster to churn
-through large amounts of data using map/reduce to produce the these files.
-The new files can be incorporated into Accumulo using bulk ingest.
-
-<ul>
-<li>Construct an <code>org.apache.accumulo.core.client.Connector</code> 
instance</li>
-<li>Call <code>connector.tableOperations().getSplits()</code></li>
-<li>Run a map/reduce job using <code>RangePartitioner</code>
-with splits from the previous step</li>
-<li>Call <code>connector.tableOperations().importDirectory()</code> passing 
the output directory of the MapReduce job</li>
-</ul>
-
-<p>Files can also be imported using the "importdirectory" shell command.
-
-<p>A complete example is available in <a 
href='examples/README.bulkIngest'>README.bulkIngest</a>
-
-<p>Importing data using whole files of sorted data can be very efficient, but 
it differs
-from live ingest in the following ways:
-<ul>
- <li>Table constraints are not applied against they data in the file.
- <li>Adding new files to tables are likely to trigger major compactions.
- <li>The timestamp in the file could contain strange values. Accumulo can be 
asked to use the ingest timestamp for all values if this is a concern.
- <li>It is possible to create invalid visibility values (for example "&|"). 
This will cause errors when the data is accessed.
- <li>Bulk imports do not effect the entry counts in the monitor page until the 
files are compacted.
-</ul>
-
-<h2>Best Practices</h2>
-
-<p>Consider two approaches to creating ingest files using map/reduce.
-
-<ol>
- <li>A large file containing the Key/Value pairs for only a single tablet.
- <li>A set of small files containing Key/Value pairs for every tablet.
-<ol>
-
-<p>In the first case, adding the file requires telling a single tablet server 
about a single file. Even if the file
-is 20G in size, it is one call to the tablet server. The tablet server makes 
one extra file entry in the
-tablet's metadata, and the data is now part of the tablet.
-
-<p>In the second case, an request must be made for each tablet for each file 
to be added. If there
-100 files and 100 tablets, this will be 10K requests, and the number of files 
needed to be opened
-for scans on these tablets will be very large. Major compactions will most 
likely start which will eventually
-fix the problem, but a lot more work needs to be done by accumulo to read 
these files.
-
-<p>Getting good, fast, bulk import performance depends on creating files like 
the first, and avoiding files like
-the second.
-
-<p>For this reason, a RangePartitioner should be used to create files when
-writing with the AccumuloFileOutputFormat.
-
-<p>Hash partition is not recommended because it will put keys in random
-groups, exactly like our bad approach.
-
-<P>Any set of cut points for range partitioning can be used in a map
-reduce job, but using Accumulo's current splits is probably the most
-optimal thing to do. However in some cases there may be too many
-splits. For example if there are 2000 splits, you would need to run
-2001 reducers. To overcome this problem use the
-<code>connector.tableOperations.getSplits(&lt;table name&gt;,&lt;max
-splits&gt;)</code> method. This method will not return more than
-<code> &lt;max splits&gt; </code> splits, but the splits it returns
-will optimally partition the data for Accumulo.
-
-<p>Remember that Accumulo never splits rows across tablets.
-Therefore the range partitioner only considers rows when partitioning.
-
-<p>When bulk importing many files into a new table, it might be good to 
pre-split the table to bring
-additional resources to accepting the data. For example, if you know your data 
is indexed based on the
-date, pre-creating splits for each day will allow files to fall into natural 
splits. Having more tablets
-accept the new data means that more resources can be used to import the data 
right away.
-
-<p>An alternative to bulk ingest is to have a map/reduce job use
-<code>AccumuloOutputFormat</code>, which can support billions of inserts per
-hour, depending on the size of your cluster. This is sufficient for
-most users, but bulk ingest remains the fastest way to incorporate
-data into Accumulo. In addition, bulk ingest has one advantage over
-AccumuloOutputFormat: there is no duplicate data insertion. When one uses
-map/reduce to output data to accumulo, restarted jobs may re-enter
-data from previous failed attempts. Generally, this only matters when
-there are aggregators. With bulk ingest, reducers are writing to new
-map files, so it does not matter. If a reduce fails, you create a new
-map file. When all reducers finish, you bulk ingest the map files
-into Accumulo. The disadvantage to bulk ingest over 
<code>AccumuloOutputFormat</code> is
-greater latency: the entire map/reduce job must complete
-before any data is available.
-
-</body>
-</html>

[08/11] accumulo git commit: ACCUMULO-4531 Remove bulkIngest.html

Reply via email to