Author: elserj Date: Fri Mar 22 19:13:57 2019 New Revision: 1856074 URL: http://svn.apache.org/viewvc?rev=1856074&view=rev Log: PHOENIX-5205 Clarify limitation around incremental loads via CSVBulkLoad w/ mutable 2ndary index
Modified: phoenix/site/publish/language/datatypes.html phoenix/site/publish/language/functions.html phoenix/site/publish/language/index.html phoenix/site/publish/secondary_indexing.html phoenix/site/source/src/site/markdown/secondary_indexing.md Modified: phoenix/site/publish/language/datatypes.html URL: http://svn.apache.org/viewvc/phoenix/site/publish/language/datatypes.html?rev=1856074&r1=1856073&r2=1856074&view=diff ============================================================================== --- phoenix/site/publish/language/datatypes.html (original) +++ phoenix/site/publish/language/datatypes.html Fri Mar 22 19:13:57 2019 @@ -1,7 +1,7 @@ <!DOCTYPE html> <!-- - Generated by Apache Maven Doxia at 2019-02-28 + Generated by Apache Maven Doxia at 2019-03-22 Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin) --> <html xml:lang="en" lang="en"> Modified: phoenix/site/publish/language/functions.html URL: http://svn.apache.org/viewvc/phoenix/site/publish/language/functions.html?rev=1856074&r1=1856073&r2=1856074&view=diff ============================================================================== --- phoenix/site/publish/language/functions.html (original) +++ phoenix/site/publish/language/functions.html Fri Mar 22 19:13:57 2019 @@ -1,7 +1,7 @@ <!DOCTYPE html> <!-- - Generated by Apache Maven Doxia at 2019-02-28 + Generated by Apache Maven Doxia at 2019-03-22 Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin) --> <html xml:lang="en" lang="en"> Modified: phoenix/site/publish/language/index.html URL: http://svn.apache.org/viewvc/phoenix/site/publish/language/index.html?rev=1856074&r1=1856073&r2=1856074&view=diff ============================================================================== --- phoenix/site/publish/language/index.html (original) +++ phoenix/site/publish/language/index.html Fri Mar 22 19:13:57 2019 @@ -1,7 +1,7 @@ <!DOCTYPE html> <!-- - Generated by Apache Maven Doxia at 2019-02-28 + Generated by Apache Maven Doxia at 2019-03-22 Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin) --> <html xml:lang="en" lang="en"> Modified: phoenix/site/publish/secondary_indexing.html URL: http://svn.apache.org/viewvc/phoenix/site/publish/secondary_indexing.html?rev=1856074&r1=1856073&r2=1856074&view=diff ============================================================================== --- phoenix/site/publish/secondary_indexing.html (original) +++ phoenix/site/publish/secondary_indexing.html Fri Mar 22 19:13:57 2019 @@ -1,7 +1,7 @@ <!DOCTYPE html> <!-- - Generated by Apache Maven Doxia at 2018-06-10 + Generated by Apache Maven Doxia at 2019-03-22 Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin) --> <html xml:lang="en" lang="en"> @@ -338,6 +338,12 @@ CREATE LOCAL INDEX my_index ON my_table <li><tt>phoenix.index.failure.handling.rebuild</tt> must be set to false to disable a mutable index from being rebuilt in the background in the event of a commit failure.</li> </ul> </div> + <div class="section"> + <h4 id="BulkLoad_Tool_Limitation">BulkLoad Tool Limitation</h4> + <p>The BulkLoadTools (e.g. CSVBulkLoadTool and JSONBulkLoadTool) cannot presently generate correct updates to mutable secondary indexes when pre-existing records are being updated. In the normal mutable secondary index write path, we can safely calculate a Delete (for the old record) and a Put (for the new record) for each secondary index while holding a row-lock to prevent concurrent updates. In the context of a MapReduce job, we cannot effectively execute this same logic because we are specifically doing this âout of bandâ from the HBase RegionServers. As such, while these Tools generate HFiles for the index tables with the proper updates for the data being loaded, any previous index records corresponding to the same record in the table are not deleted. This net-effect of this limitation is: if you use these Tools to re-ingest the same records to an index table, that index table will have duplicate records in it which will result in incorrect query results from that i ndex table.</p> + <p>To perform incremental loads of data using the BulkLoadTools which may update existing records, you must drop and re-create all index tables after the data table is loaded. Re-creating the index with the <tt>ASYNC</tt> option and using <tt>IndexTool</tt> to populate and enable that index is likely a must for tables of non-trivial size.</p> + <p>To perform incremental loading of CSV datasets that do not require any manual index intervention, the <tt>psql</tt> tool can be used in place of the BulkLoadTools. Additionally, a MapReduce job could be written to parse CSV/JSON data and write it directly to Phoenix; although, such a tool is not currently provided by Phoenix for users.</p> + </div> </div> </div> <div class="section"> @@ -385,7 +391,7 @@ CREATE LOCAL INDEX my_index ON my_table <div class="section"> <h3 id="Upgrading_Local_Indexes_created_before_4.8.0">Upgrading Local Indexes created before 4.8.0</h3> <p>While upgrading the Phoenix to 4.8.0+ version at server remove above three local indexing related configurations from <tt>hbase-site.xml</tt> if present. From client we are supporting both online(while initializing the connection from phoenix client of 4.8.0+ versions) and offline(using psql tool) upgrade of local indexes created before 4.8.0. As part of upgrade we recreate the local indexes in ASYNC mode. After upgrade user need to build the indexes using <a class="externalLink" href="http://phoenix.apache.org/secondary_indexing.html#Index_Population">IndexTool</a></p> - <p>Following client side configuration used in the upgrade. </p> + <p>Following client side configuration used in the upgrade.</p> <ol style="list-style-type: decimal"> <li><tt>phoenix.client.localIndexUpgrade</tt> <ul> @@ -802,7 +808,7 @@ CREATE LOCAL INDEX my_index ON my_table <div class="row"> <div class="span12"> <p class="pull-right"><a href="#">Back to top</a></p> - <p class="copyright">Copyright ©2018 <a href="http://www.apache.org">Apache Software Foundation</a>. All Rights Reserved.</p> + <p class="copyright">Copyright ©2019 <a href="http://www.apache.org">Apache Software Foundation</a>. All Rights Reserved.</p> </div> </div> </div> Modified: phoenix/site/source/src/site/markdown/secondary_indexing.md URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/secondary_indexing.md?rev=1856074&r1=1856073&r2=1856074&view=diff ============================================================================== --- phoenix/site/source/src/site/markdown/secondary_indexing.md (original) +++ phoenix/site/source/src/site/markdown/secondary_indexing.md Fri Mar 22 19:13:57 2019 @@ -202,6 +202,26 @@ The following server-side configurations * <code>phoenix.index.failure.handling.rebuild</code> must be set to false to disable a mutable index from being rebuilt in the background in the event of a commit failure. +#### BulkLoad Tool Limitation + +The BulkLoadTools (e.g. CSVBulkLoadTool and JSONBulkLoadTool) cannot presently generate correct updates to mutable +secondary indexes when pre-existing records are being updated. In the normal mutable secondary index write path, we can +safely calculate a Delete (for the old record) and a Put (for the new record) for each secondary index while holding a +row-lock to prevent concurrent updates. In the context of a MapReduce job, we cannot effectively execute this same logic +because we are specifically doing this "out of band" from the HBase RegionServers. As such, while these Tools generate +HFiles for the index tables with the proper updates for the data being loaded, any previous index records corresponding +to the same record in the table are not deleted. This net-effect of this limitation is: if you use these Tools to re-ingest +the same records to an index table, that index table will have duplicate records in it which will result in incorrect +query results from that index table. + +To perform incremental loads of data using the BulkLoadTools which may update existing records, you must +drop and re-create all index tables after the data table is loaded. Re-creating the index with the `ASYNC` option and +using `IndexTool` to populate and enable that index is likely a must for tables of non-trivial size. + +To perform incremental loading of CSV datasets that do not require any manual index intervention, the `psql` tool can +be used in place of the BulkLoadTools. Additionally, a MapReduce job could be written to parse CSV/JSON data and write +it directly to Phoenix; although, such a tool is not currently provided by Phoenix for users. + ## Setup Non transactional, mutable indexing requires special configuration options on the region server and master to run - Phoenix ensures that they are setup correctly when you enable mutable indexing on the table; if the correct properties are not set, you will not be able to use secondary indexing. After adding these settings to your hbase-site.xml, you'll need to do a rolling restart of your cluster. @@ -251,9 +271,9 @@ From Phoenix 4.8.0 onward, no configurat ### Upgrading Local Indexes created before 4.8.0 While upgrading the Phoenix to 4.8.0+ version at server remove above three local indexing related configurations from `hbase-site.xml` if present. From client we are supporting both online(while initializing the connection from phoenix client of 4.8.0+ versions) and offline(using psql tool) upgrade of local indexes created before 4.8.0. As part of upgrade we recreate the local indexes in ASYNC mode. After upgrade user need to build the indexes using [IndexTool](http://phoenix.apache.org/secondary_indexing.html#Index_Population) -Following client side configuration used in the upgrade. - -1. <code>phoenix.client.localIndexUpgrade</code> +Following client side configuration used in the upgrade. + +1. <code>phoenix.client.localIndexUpgrade</code> * The value of it is true means online upgrade and false means offline upgrade. * **Default: true** @@ -288,7 +308,7 @@ All the following parameters must be set 6. hbase.htable.threads.keepalivetime * Amount of time in seconds after we expire threads in the HTable's thread pool. * Using the "direct handoff" approach, new threads will only be created if it is necessary and will grow unbounded. This could be bad but HTables only create as many Runnables as there are region servers; therefore, it also scales when new region servers are added. - * **Default: 60** + * **Default: 60** 7. index.tablefactory.cache.size * Number of index HTables we should keep in cache. * Increasing this number ensures that we do not need to recreate an HTable for each attempt to write to an index table. Conversely, you could see memory pressure if this value is set too high. @@ -323,7 +343,7 @@ It can also be run from Hadoop using eit HADOOP_CLASSPATH=$(hbase mapredcp) hadoop jar phoenix-<version>-server.jar org.apache.phoenix.mapreduce.index.IndexScrutinyTool -dt my_table -it my_index -o By default two mapreduce jobs are launched, one with the data table as the source table and one with the index table as the source table. - + The following parameters can be used with the Index Scrutiny Tool: | *Parameter* | *Description* | @@ -345,7 +365,7 @@ The following parameters can be used wit ## Resources There have been several presentations given on how secondary indexing works in Phoenix that have a more in-depth look at how indexing works (with pretty pictures!): - + * [San Francisco HBase Meetup](http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx) - Sept. 26, 2013 * [Los Anglees HBase Meetup](http://www.slideshare.net/jesse_yates/phoenix-secondary-indexing-la-hug-sept-9th-2013) - Sept, 4th, 2013 * [Local Indexes](https://github.com/Huawei-Hadoop/hindex/blob/master/README.md#how-it-works) by Huawei