Author: elserj
Date: Fri Mar 22 19:13:57 2019
New Revision: 1856074
URL: http://svn.apache.org/viewvc?rev=1856074&view=rev
Log:
PHOENIX-5205 Clarify limitation around incremental loads via CSVBulkLoad w/
mutable 2ndary index
Modified:
phoenix/site/publish/language/datatypes.html
phoenix/site/publish/language/functions.html
phoenix/site/publish/language/index.html
phoenix/site/publish/secondary_indexing.html
phoenix/site/source/src/site/markdown/secondary_indexing.md
Modified: phoenix/site/publish/language/datatypes.html
URL:
http://svn.apache.org/viewvc/phoenix/site/publish/language/datatypes.html?rev=1856074&r1=1856073&r2=1856074&view=diff
==============================================================================
--- phoenix/site/publish/language/datatypes.html (original)
+++ phoenix/site/publish/language/datatypes.html Fri Mar 22 19:13:57 2019
@@ -1,7 +1,7 @@
<!DOCTYPE html>
<!--
- Generated by Apache Maven Doxia at 2019-02-28
+ Generated by Apache Maven Doxia at 2019-03-22
Rendered using Reflow Maven Skin 1.1.0
(http://andriusvelykis.github.io/reflow-maven-skin)
-->
<html xml:lang="en" lang="en">
Modified: phoenix/site/publish/language/functions.html
URL:
http://svn.apache.org/viewvc/phoenix/site/publish/language/functions.html?rev=1856074&r1=1856073&r2=1856074&view=diff
==============================================================================
--- phoenix/site/publish/language/functions.html (original)
+++ phoenix/site/publish/language/functions.html Fri Mar 22 19:13:57 2019
@@ -1,7 +1,7 @@
<!DOCTYPE html>
<!--
- Generated by Apache Maven Doxia at 2019-02-28
+ Generated by Apache Maven Doxia at 2019-03-22
Rendered using Reflow Maven Skin 1.1.0
(http://andriusvelykis.github.io/reflow-maven-skin)
-->
<html xml:lang="en" lang="en">
Modified: phoenix/site/publish/language/index.html
URL:
http://svn.apache.org/viewvc/phoenix/site/publish/language/index.html?rev=1856074&r1=1856073&r2=1856074&view=diff
==============================================================================
--- phoenix/site/publish/language/index.html (original)
+++ phoenix/site/publish/language/index.html Fri Mar 22 19:13:57 2019
@@ -1,7 +1,7 @@
<!DOCTYPE html>
<!--
- Generated by Apache Maven Doxia at 2019-02-28
+ Generated by Apache Maven Doxia at 2019-03-22
Rendered using Reflow Maven Skin 1.1.0
(http://andriusvelykis.github.io/reflow-maven-skin)
-->
<html xml:lang="en" lang="en">
Modified: phoenix/site/publish/secondary_indexing.html
URL:
http://svn.apache.org/viewvc/phoenix/site/publish/secondary_indexing.html?rev=1856074&r1=1856073&r2=1856074&view=diff
==============================================================================
--- phoenix/site/publish/secondary_indexing.html (original)
+++ phoenix/site/publish/secondary_indexing.html Fri Mar 22 19:13:57 2019
@@ -1,7 +1,7 @@
<!DOCTYPE html>
<!--
- Generated by Apache Maven Doxia at 2018-06-10
+ Generated by Apache Maven Doxia at 2019-03-22
Rendered using Reflow Maven Skin 1.1.0
(http://andriusvelykis.github.io/reflow-maven-skin)
-->
<html xml:lang="en" lang="en">
@@ -338,6 +338,12 @@ CREATE LOCAL INDEX my_index ON my_table
<li><tt>phoenix.index.failure.handling.rebuild</tt> must be set to false
to disable a mutable index from being rebuilt in the background in the event of
a commit failure.</li>
</ul>
</div>
+ <div class="section">
+ <h4 id="BulkLoad_Tool_Limitation">BulkLoad Tool Limitation</h4>
+ <p>The BulkLoadTools (e.g. CSVBulkLoadTool and JSONBulkLoadTool) cannot
presently generate correct updates to mutable secondary indexes when
pre-existing records are being updated. In the normal mutable secondary index
write path, we can safely calculate a Delete (for the old record) and a Put
(for the new record) for each secondary index while holding a row-lock to
prevent concurrent updates. In the context of a MapReduce job, we cannot
effectively execute this same logic because we are specifically doing this
âout of bandâ from the HBase RegionServers. As such, while these Tools
generate HFiles for the index tables with the proper updates for the data being
loaded, any previous index records corresponding to the same record in the
table are not deleted. This net-effect of this limitation is: if you use these
Tools to re-ingest the same records to an index table, that index table will
have duplicate records in it which will result in incorrect query results from
that i
ndex table.</p>
+ <p>To perform incremental loads of data using the BulkLoadTools which may
update existing records, you must drop and re-create all index tables after the
data table is loaded. Re-creating the index with the <tt>ASYNC</tt> option and
using <tt>IndexTool</tt> to populate and enable that index is likely a must for
tables of non-trivial size.</p>
+ <p>To perform incremental loading of CSV datasets that do not require any
manual index intervention, the <tt>psql</tt> tool can be used in place of the
BulkLoadTools. Additionally, a MapReduce job could be written to parse CSV/JSON
data and write it directly to Phoenix; although, such a tool is not currently
provided by Phoenix for users.</p>
+ </div>
</div>
</div>
<div class="section">
@@ -385,7 +391,7 @@ CREATE LOCAL INDEX my_index ON my_table
<div class="section">
<h3 id="Upgrading_Local_Indexes_created_before_4.8.0">Upgrading Local
Indexes created before 4.8.0</h3>
<p>While upgrading the Phoenix to 4.8.0+ version at server remove above
three local indexing related configurations from <tt>hbase-site.xml</tt> if
present. From client we are supporting both online(while initializing the
connection from phoenix client of 4.8.0+ versions) and offline(using psql tool)
upgrade of local indexes created before 4.8.0. As part of upgrade we recreate
the local indexes in ASYNC mode. After upgrade user need to build the indexes
using <a class="externalLink"
href="http://phoenix.apache.org/secondary_indexing.html#Index_Population">IndexTool</a></p>
- <p>Following client side configuration used in the upgrade. </p>
+ <p>Following client side configuration used in the upgrade.</p>
<ol style="list-style-type: decimal">
<li><tt>phoenix.client.localIndexUpgrade</tt>
<ul>
@@ -802,7 +808,7 @@ CREATE LOCAL INDEX my_index ON my_table
<div class="row">
<div class="span12">
<p class="pull-right"><a href="#">Back to
top</a></p>
- <p class="copyright">Copyright ©2018 <a
href="http://www.apache.org">Apache Software Foundation</a>. All Rights
Reserved.</p>
+ <p class="copyright">Copyright ©2019 <a
href="http://www.apache.org">Apache Software Foundation</a>. All Rights
Reserved.</p>
</div>
</div>
</div>
Modified: phoenix/site/source/src/site/markdown/secondary_indexing.md
URL:
http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/secondary_indexing.md?rev=1856074&r1=1856073&r2=1856074&view=diff
==============================================================================
--- phoenix/site/source/src/site/markdown/secondary_indexing.md (original)
+++ phoenix/site/source/src/site/markdown/secondary_indexing.md Fri Mar 22
19:13:57 2019
@@ -202,6 +202,26 @@ The following server-side configurations
* <code>phoenix.index.failure.handling.rebuild</code> must be set to false to
disable a mutable index from being
rebuilt in the background in the event of a commit failure.
+#### BulkLoad Tool Limitation
+
+The BulkLoadTools (e.g. CSVBulkLoadTool and JSONBulkLoadTool) cannot presently
generate correct updates to mutable
+secondary indexes when pre-existing records are being updated. In the normal
mutable secondary index write path, we can
+safely calculate a Delete (for the old record) and a Put (for the new record)
for each secondary index while holding a
+row-lock to prevent concurrent updates. In the context of a MapReduce job, we
cannot effectively execute this same logic
+because we are specifically doing this "out of band" from the HBase
RegionServers. As such, while these Tools generate
+HFiles for the index tables with the proper updates for the data being loaded,
any previous index records corresponding
+to the same record in the table are not deleted. This net-effect of this
limitation is: if you use these Tools to re-ingest
+the same records to an index table, that index table will have duplicate
records in it which will result in incorrect
+query results from that index table.
+
+To perform incremental loads of data using the BulkLoadTools which may update
existing records, you must
+drop and re-create all index tables after the data table is loaded.
Re-creating the index with the `ASYNC` option and
+using `IndexTool` to populate and enable that index is likely a must for
tables of non-trivial size.
+
+To perform incremental loading of CSV datasets that do not require any manual
index intervention, the `psql` tool can
+be used in place of the BulkLoadTools. Additionally, a MapReduce job could be
written to parse CSV/JSON data and write
+it directly to Phoenix; although, such a tool is not currently provided by
Phoenix for users.
+
## Setup
Non transactional, mutable indexing requires special configuration options on
the region server and master to run - Phoenix ensures that they are setup
correctly when you enable mutable indexing on the table; if the correct
properties are not set, you will not be able to use secondary indexing. After
adding these settings to your hbase-site.xml, you'll need to do a rolling
restart of your cluster.
@@ -251,9 +271,9 @@ From Phoenix 4.8.0 onward, no configurat
### Upgrading Local Indexes created before 4.8.0
While upgrading the Phoenix to 4.8.0+ version at server remove above three
local indexing related configurations from `hbase-site.xml` if present. From
client we are supporting both online(while initializing the connection from
phoenix client of 4.8.0+ versions) and offline(using psql tool) upgrade of
local indexes created before 4.8.0. As part of upgrade we recreate the local
indexes in ASYNC mode. After upgrade user need to build the indexes using
[IndexTool](http://phoenix.apache.org/secondary_indexing.html#Index_Population)
-Following client side configuration used in the upgrade.
-
-1. <code>phoenix.client.localIndexUpgrade</code>
+Following client side configuration used in the upgrade.
+
+1. <code>phoenix.client.localIndexUpgrade</code>
* The value of it is true means online upgrade and false means offline
upgrade.
* **Default: true**
@@ -288,7 +308,7 @@ All the following parameters must be set
6. hbase.htable.threads.keepalivetime
* Amount of time in seconds after we expire threads in the HTable's thread
pool.
* Using the "direct handoff" approach, new threads will only be created if
it is necessary and will grow unbounded. This could be bad but HTables only
create as many Runnables as there are region servers; therefore, it also scales
when new region servers are added.
- * **Default: 60**
+ * **Default: 60**
7. index.tablefactory.cache.size
* Number of index HTables we should keep in cache.
* Increasing this number ensures that we do not need to recreate an HTable
for each attempt to write to an index table. Conversely, you could see memory
pressure if this value is set too high.
@@ -323,7 +343,7 @@ It can also be run from Hadoop using eit
HADOOP_CLASSPATH=$(hbase mapredcp) hadoop jar phoenix-<version>-server.jar
org.apache.phoenix.mapreduce.index.IndexScrutinyTool -dt my_table -it my_index
-o
By default two mapreduce jobs are launched, one with the data table as the
source table and one with the index table as the source table.
-
+
The following parameters can be used with the Index Scrutiny Tool:
| *Parameter* | *Description* |
@@ -345,7 +365,7 @@ The following parameters can be used wit
## Resources
There have been several presentations given on how secondary indexing works in
Phoenix that have a more in-depth look at how indexing works (with pretty
pictures!):
-
+
* [San Francisco HBase
Meetup](http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx) -
Sept. 26, 2013
* [Los Anglees HBase
Meetup](http://www.slideshare.net/jesse_yates/phoenix-secondary-indexing-la-hug-sept-9th-2013)
- Sept, 4th, 2013
* [Local
Indexes](https://github.com/Huawei-Hadoop/hindex/blob/master/README.md#how-it-works)
by Huawei