ACCUMULO-4531 Remove bulkIngest.html * Similar documentation exists in 'High-speed Ingest' section of user manual
Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/c0e68a3a Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/c0e68a3a Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/c0e68a3a Branch: refs/heads/master Commit: c0e68a3a82bb6fa7eb370039537aac8337f22e47 Parents: 43e98d4 Author: Mike Walch <mwa...@apache.org> Authored: Tue Dec 6 10:58:23 2016 -0500 Committer: Mike Walch <mwa...@apache.org> Committed: Tue Dec 6 14:04:54 2016 -0500 ---------------------------------------------------------------------- docs/src/main/resources/bulkIngest.html | 114 --------------------------- 1 file changed, 114 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo/blob/c0e68a3a/docs/src/main/resources/bulkIngest.html ---------------------------------------------------------------------- diff --git a/docs/src/main/resources/bulkIngest.html b/docs/src/main/resources/bulkIngest.html deleted file mode 100644 index 9e9896e..0000000 --- a/docs/src/main/resources/bulkIngest.html +++ /dev/null @@ -1,114 +0,0 @@ -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> -<html> -<head> -<title>Accumulo Bulk Ingest</title> -<link rel='stylesheet' type='text/css' href='documentation.css' media='screen'/> -</head> -<body> - -<h1>Apache Accumulo Documentation : Bulk Ingest</h2> - -<p>Accumulo supports the ability to import sorted files produced by an -external process into an online table. Often, it is much faster to churn -through large amounts of data using map/reduce to produce the these files. -The new files can be incorporated into Accumulo using bulk ingest. - -<ul> -<li>Construct an <code>org.apache.accumulo.core.client.Connector</code> instance</li> -<li>Call <code>connector.tableOperations().getSplits()</code></li> -<li>Run a map/reduce job using <code>RangePartitioner</code> -with splits from the previous step</li> -<li>Call <code>connector.tableOperations().importDirectory()</code> passing the output directory of the MapReduce job</li> -</ul> - -<p>Files can also be imported using the "importdirectory" shell command. - -<p>A complete example is available in <a href='examples/README.bulkIngest'>README.bulkIngest</a> - -<p>Importing data using whole files of sorted data can be very efficient, but it differs -from live ingest in the following ways: -<ul> - <li>Table constraints are not applied against they data in the file. - <li>Adding new files to tables are likely to trigger major compactions. - <li>The timestamp in the file could contain strange values. Accumulo can be asked to use the ingest timestamp for all values if this is a concern. - <li>It is possible to create invalid visibility values (for example "&|"). This will cause errors when the data is accessed. - <li>Bulk imports do not effect the entry counts in the monitor page until the files are compacted. -</ul> - -<h2>Best Practices</h2> - -<p>Consider two approaches to creating ingest files using map/reduce. - -<ol> - <li>A large file containing the Key/Value pairs for only a single tablet. - <li>A set of small files containing Key/Value pairs for every tablet. -<ol> - -<p>In the first case, adding the file requires telling a single tablet server about a single file. Even if the file -is 20G in size, it is one call to the tablet server. The tablet server makes one extra file entry in the -tablet's metadata, and the data is now part of the tablet. - -<p>In the second case, an request must be made for each tablet for each file to be added. If there -100 files and 100 tablets, this will be 10K requests, and the number of files needed to be opened -for scans on these tablets will be very large. Major compactions will most likely start which will eventually -fix the problem, but a lot more work needs to be done by accumulo to read these files. - -<p>Getting good, fast, bulk import performance depends on creating files like the first, and avoiding files like -the second. - -<p>For this reason, a RangePartitioner should be used to create files when -writing with the AccumuloFileOutputFormat. - -<p>Hash partition is not recommended because it will put keys in random -groups, exactly like our bad approach. - -<P>Any set of cut points for range partitioning can be used in a map -reduce job, but using Accumulo's current splits is probably the most -optimal thing to do. However in some cases there may be too many -splits. For example if there are 2000 splits, you would need to run -2001 reducers. To overcome this problem use the -<code>connector.tableOperations.getSplits(<table name>,<max -splits>)</code> method. This method will not return more than -<code> <max splits> </code> splits, but the splits it returns -will optimally partition the data for Accumulo. - -<p>Remember that Accumulo never splits rows across tablets. -Therefore the range partitioner only considers rows when partitioning. - -<p>When bulk importing many files into a new table, it might be good to pre-split the table to bring -additional resources to accepting the data. For example, if you know your data is indexed based on the -date, pre-creating splits for each day will allow files to fall into natural splits. Having more tablets -accept the new data means that more resources can be used to import the data right away. - -<p>An alternative to bulk ingest is to have a map/reduce job use -<code>AccumuloOutputFormat</code>, which can support billions of inserts per -hour, depending on the size of your cluster. This is sufficient for -most users, but bulk ingest remains the fastest way to incorporate -data into Accumulo. In addition, bulk ingest has one advantage over -AccumuloOutputFormat: there is no duplicate data insertion. When one uses -map/reduce to output data to accumulo, restarted jobs may re-enter -data from previous failed attempts. Generally, this only matters when -there are aggregators. With bulk ingest, reducers are writing to new -map files, so it does not matter. If a reduce fails, you create a new -map file. When all reducers finish, you bulk ingest the map files -into Accumulo. The disadvantage to bulk ingest over <code>AccumuloOutputFormat</code> is -greater latency: the entire map/reduce job must complete -before any data is available. - -</body> -</html>