Re: secondary index feature
Thanks James! I have some Phoenix specific questions. I suppose the Phoenix group is a better place to discuss those though. Henning On 03.01.2014 22:34, James Taylor wrote: No worries, Henning. It's a little deceiving, because the coprocessors that do the index maintenance are invoked on a per region basis. However, the writes/puts that they do for the maintenance end up going over the wire if necessary. Let me know if you have other questions. It'd be good to understand your use case more to see if Phoenix is a good fit - we're definitely open to collaborating. FYI, we're in the process of moving to Apache, so will keep you posted once the transition is complete. Thanks, James On Fri, Jan 3, 2014 at 1:11 PM, Henning Blohm henning.bl...@zfabrik.dewrote: Hi James, this is a little embarassing... I even browsed through the code and read it as implementing a region level index. But now at least I get the restrictions mentioned for using the covered indexes. Thanks for clarifying. Guess I need to browse the code a little harder ;-) Henning On 03.01.2014 21:53, James Taylor wrote: Hi Henning, Phoenix maintains a global index. It is essentially maintaining another HBase table for you with a different row key (and a subset of your data table columns that are covered). When an index is used by Phoenix, it is *exactly* like querying a data table (that's what Phoenix does - it ends up issuing a Phoenix query against a Phoenix table that happens to be an index table). The hit you take for a global index is at write time - we need to look up the prior state of the rows being updated to do the index maintenance. Then we need to do a write to the index table. The upside is that there's no hit at read/query time (we don't yet attempt to join from the index table back to the data table - if a query is using columns that aren't in the index, it simply won't be used). More here: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing Thanks, James On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm henning.bl...@zfabrik.de wrote: When scanning in order of an index and you use RLI, it seems, there is no alternative but to involve all regions - and essentially this should happen in parallel as otherwise you might not get what you wanted. Also, for a single Get, it seems (as Lars pointed out in https://issues.apache.org/ jira/browse/HBASE-2038) that you have to consult all regions. When that parallelism is no problem (small number of servers) it will actually help single scan performance (regions can provide their share in parallel). A high number of concurrent client requests leads to the same number of requests on all regions and its multiple of connections to be maintained by the client. My assumption is that that will eventually lead to a scalability problem - when, say, having a 100 region servers or so in place. I was wondering, if anyone has experience with that. That will be perfectly acceptable for many use cases that benefit from the scan (and hence query) performance more than they suffer from the load problem. Other use cases have less requirements on scans and query flexibility but rather want to preserve the quality that a Get has fixed resource usage. Btw.: I was convinces that Phoenix is keeping indexes on the region level. Is that not so? Thanks, Henning On 03.01.2014 17:57, Anoop John wrote: In case of HBase normal scan as we know, regions will be scanned sequentially. Pheonix having parallel scan impls in it. When RLI is used and we make use of index completely at server side, it is irrespective of client scan ways. Sequential or parallel, using java or any other client layer or using SQL layer like Phoenix, using MR or not... all client side dont have to worry abt this but the usage will be fully at server end. Yes when parallel scan is done on regions, RLI might perform much better. -Anoop- On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: No. the regions scanned sequentially. From: Asaf Mesika [asaf.mes...@gmail.com] Sent: Friday, January 03, 2014 7:26 PM To: user@hbase.apache.org Subject: Re: secondary index feature Are the regions scanned in parallel? On Friday, January 3, 2014, rajeshbabu chintaguntla wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com javascript:;] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org javascript:; Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates
Re: secondary index feature
Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylorjtay...@salesforce.comwrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data-guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohmhenning.bl...@zfabrik.dewrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R Job) So validating whether an index entry is valid means to load the entity pointed to and - before considering it a valid result - validating whether values of the entity still match with the index. The entity is written last, hence when the client dies halfway through the update you may get stale index entries but nothing else should break. For scanning along the index, we are using a chunk iterator that is, we read n index entries ahead and then do point look ups for the entities. How would you avoid point-gets when scanning via an index (as most likely, entities are ordered independently from the index - hence the index)? Something really important to note is that there is no intention to build a completely generic solution, in particular not (this time - unlike the other post of mine you responded to) taking row versioning into account. Instead, row time stamps are used to delete stale entries (old entries after an index rebuild). Thanks a lot for your blog pointers. Haven't had time to study in depth but at first glance there is lot of overlap of what you are proposing and what I ended up doing considering the first post. On the second post: Indeed I have not worried too much about transactional isolation of updates. If index update and entity update use the same HBase time stamp, the result should at least be consistent, right? Btw. in no way am I claiming originality of my thoughts - in particular I
Re: secondary index feature
Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.dewrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/ master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylorjtay...@salesforce.com wrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohmhenning.bl...@zfabrik.de wrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R Job) So validating whether an index entry is valid means to load the entity pointed to and - before considering it a valid result - validating whether values of the entity still match with the index. The entity is written last, hence when the client dies halfway through the update you may get stale index entries but nothing else should break. For scanning along the index, we are using a chunk iterator that is, we read n index entries ahead and then do point look ups for the entities. How would you avoid point-gets when scanning via an index (as most likely, entities are ordered independently from the index - hence the index)? Something really important to note is that there is no intention to build a completely generic solution, in particular not (this time - unlike the other post of mine you responded to) taking row versioning into account. Instead, row time stamps are used to
RE: secondary index feature
Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.dewrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/ master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylorjtay...@salesforce.com wrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohmhenning.bl...@zfabrik.de wrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R Job) So validating whether an index entry is valid means to load the entity pointed to and - before considering it a valid result - validating whether values of the entity still match with the index. The entity is written last, hence when the client dies halfway through the update you may get stale index entries but nothing else should break. For scanning along the index, we
Re: secondary index feature
A proportional difference in time taken, wrt increase in # RSs (keeping No#rows matching values constant), would be what is of utmost interest. -Anoop- On Fri, Jan 3, 2014 at 3:49 PM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/ master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylorjtay...@salesforce.com wrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R
Re: secondary index feature
What is generally of interest? RLI or global level. I know it is based on usecase but is there a common need? On Fri, Jan 3, 2014 at 4:31 PM, Anoop John anoop.hb...@gmail.com wrote: A proportional difference in time taken, wrt increase in # RSs (keeping No#rows matching values constant), would be what is of utmost interest. -Anoop- On Fri, Jan 3, 2014 at 3:49 PM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/ master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylor jtay...@salesforce.com wrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox
Re: secondary index feature
Are the regions scanned in parallel? On Friday, January 3, 2014, rajeshbabu chintaguntla wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com javascript:;] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org javascript:; Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/ master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylorjtay...@salesforce.com wrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his
Re: secondary index feature
I think both approaches should be provided to HBase users. These are new features that would both find proper usage scenarios. Cheers On Jan 3, 2014, at 5:48 AM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: What is generally of interest? RLI or global level. I know it is based on usecase but is there a common need? On Fri, Jan 3, 2014 at 4:31 PM, Anoop John anoop.hb...@gmail.com wrote: A proportional difference in time taken, wrt increase in # RSs (keeping No#rows matching values constant), would be what is of utmost interest. -Anoop- On Fri, Jan 3, 2014 at 3:49 PM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/ master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylor jtay...@salesforce.com wrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when
RE: secondary index feature
No. the regions scanned sequentially. From: Asaf Mesika [asaf.mes...@gmail.com] Sent: Friday, January 03, 2014 7:26 PM To: user@hbase.apache.org Subject: Re: secondary index feature Are the regions scanned in parallel? On Friday, January 3, 2014, rajeshbabu chintaguntla wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com javascript:;] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org javascript:; Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/ master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylorjtay...@salesforce.com wrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his
Re: secondary index feature
When scanning in order of an index and you use RLI, it seems, there is no alternative but to involve all regions - and essentially this should happen in parallel as otherwise you might not get what you wanted. Also, for a single Get, it seems (as Lars pointed out in https://issues.apache.org/jira/browse/HBASE-2038) that you have to consult all regions. When that parallelism is no problem (small number of servers) it will actually help single scan performance (regions can provide their share in parallel). A high number of concurrent client requests leads to the same number of requests on all regions and its multiple of connections to be maintained by the client. My assumption is that that will eventually lead to a scalability problem - when, say, having a 100 region servers or so in place. I was wondering, if anyone has experience with that. That will be perfectly acceptable for many use cases that benefit from the scan (and hence query) performance more than they suffer from the load problem. Other use cases have less requirements on scans and query flexibility but rather want to preserve the quality that a Get has fixed resource usage. Btw.: I was convinces that Phoenix is keeping indexes on the region level. Is that not so? Thanks, Henning On 03.01.2014 17:57, Anoop John wrote: In case of HBase normal scan as we know, regions will be scanned sequentially. Pheonix having parallel scan impls in it. When RLI is used and we make use of index completely at server side, it is irrespective of client scan ways. Sequential or parallel, using java or any other client layer or using SQL layer like Phoenix, using MR or not... all client side dont have to worry abt this but the usage will be fully at server end. Yes when parallel scan is done on regions, RLI might perform much better. -Anoop- On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: No. the regions scanned sequentially. From: Asaf Mesika [asaf.mes...@gmail.com] Sent: Friday, January 03, 2014 7:26 PM To: user@hbase.apache.org Subject: Re: secondary index feature Are the regions scanned in parallel? On Friday, January 3, 2014, rajeshbabu chintaguntla wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com javascript:;] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org javascript:; Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we have a large data set (hence HBASE) that will be queried (mostly point-gets via an index) in some linear correlation with its size. Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Thanks, Henning On 24.12.2013 12:18, Henning Blohm wrote: All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/ master/src/main
Re: secondary index feature
Hi Henning, Phoenix maintains a global index. It is essentially maintaining another HBase table for you with a different row key (and a subset of your data table columns that are covered). When an index is used by Phoenix, it is *exactly* like querying a data table (that's what Phoenix does - it ends up issuing a Phoenix query against a Phoenix table that happens to be an index table). The hit you take for a global index is at write time - we need to look up the prior state of the rows being updated to do the index maintenance. Then we need to do a write to the index table. The upside is that there's no hit at read/query time (we don't yet attempt to join from the index table back to the data table - if a query is using columns that aren't in the index, it simply won't be used). More here: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing Thanks, James On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm henning.bl...@zfabrik.dewrote: When scanning in order of an index and you use RLI, it seems, there is no alternative but to involve all regions - and essentially this should happen in parallel as otherwise you might not get what you wanted. Also, for a single Get, it seems (as Lars pointed out in https://issues.apache.org/ jira/browse/HBASE-2038) that you have to consult all regions. When that parallelism is no problem (small number of servers) it will actually help single scan performance (regions can provide their share in parallel). A high number of concurrent client requests leads to the same number of requests on all regions and its multiple of connections to be maintained by the client. My assumption is that that will eventually lead to a scalability problem - when, say, having a 100 region servers or so in place. I was wondering, if anyone has experience with that. That will be perfectly acceptable for many use cases that benefit from the scan (and hence query) performance more than they suffer from the load problem. Other use cases have less requirements on scans and query flexibility but rather want to preserve the quality that a Get has fixed resource usage. Btw.: I was convinces that Phoenix is keeping indexes on the region level. Is that not so? Thanks, Henning On 03.01.2014 17:57, Anoop John wrote: In case of HBase normal scan as we know, regions will be scanned sequentially. Pheonix having parallel scan impls in it. When RLI is used and we make use of index completely at server side, it is irrespective of client scan ways. Sequential or parallel, using java or any other client layer or using SQL layer like Phoenix, using MR or not... all client side dont have to worry abt this but the usage will be fully at server end. Yes when parallel scan is done on regions, RLI might perform much better. -Anoop- On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: No. the regions scanned sequentially. From: Asaf Mesika [asaf.mes...@gmail.com] Sent: Friday, January 03, 2014 7:26 PM To: user@hbase.apache.org Subject: Re: secondary index feature Are the regions scanned in parallel? On Friday, January 3, 2014, rajeshbabu chintaguntla wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com javascript:;] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org javascript:; Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table to store (index value, entity key) pairs. The main concern I have with region-level indexing (RLI) is that Gets potentially require to visit all regions. Compared to global index tables this seems to flatten the read-scalability curve of the cluster. In our case, we
Re: secondary index feature
Hi James, this is a little embarassing... I even browsed through the code and read it as implementing a region level index. But now at least I get the restrictions mentioned for using the covered indexes. Thanks for clarifying. Guess I need to browse the code a little harder ;-) Henning On 03.01.2014 21:53, James Taylor wrote: Hi Henning, Phoenix maintains a global index. It is essentially maintaining another HBase table for you with a different row key (and a subset of your data table columns that are covered). When an index is used by Phoenix, it is *exactly* like querying a data table (that's what Phoenix does - it ends up issuing a Phoenix query against a Phoenix table that happens to be an index table). The hit you take for a global index is at write time - we need to look up the prior state of the rows being updated to do the index maintenance. Then we need to do a write to the index table. The upside is that there's no hit at read/query time (we don't yet attempt to join from the index table back to the data table - if a query is using columns that aren't in the index, it simply won't be used). More here: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing Thanks, James On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm henning.bl...@zfabrik.dewrote: When scanning in order of an index and you use RLI, it seems, there is no alternative but to involve all regions - and essentially this should happen in parallel as otherwise you might not get what you wanted. Also, for a single Get, it seems (as Lars pointed out in https://issues.apache.org/ jira/browse/HBASE-2038) that you have to consult all regions. When that parallelism is no problem (small number of servers) it will actually help single scan performance (regions can provide their share in parallel). A high number of concurrent client requests leads to the same number of requests on all regions and its multiple of connections to be maintained by the client. My assumption is that that will eventually lead to a scalability problem - when, say, having a 100 region servers or so in place. I was wondering, if anyone has experience with that. That will be perfectly acceptable for many use cases that benefit from the scan (and hence query) performance more than they suffer from the load problem. Other use cases have less requirements on scans and query flexibility but rather want to preserve the quality that a Get has fixed resource usage. Btw.: I was convinces that Phoenix is keeping indexes on the region level. Is that not so? Thanks, Henning On 03.01.2014 17:57, Anoop John wrote: In case of HBase normal scan as we know, regions will be scanned sequentially. Pheonix having parallel scan impls in it. When RLI is used and we make use of index completely at server side, it is irrespective of client scan ways. Sequential or parallel, using java or any other client layer or using SQL layer like Phoenix, using MR or not... all client side dont have to worry abt this but the usage will be fully at server end. Yes when parallel scan is done on regions, RLI might perform much better. -Anoop- On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: No. the regions scanned sequentially. From: Asaf Mesika [asaf.mes...@gmail.com] Sent: Friday, January 03, 2014 7:26 PM To: user@hbase.apache.org Subject: Re: secondary index feature Are the regions scanned in parallel? On Friday, January 3, 2014, rajeshbabu chintaguntla wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is taking in hours. Thanks, Rajeshbabu From: Anoop John [anoop.hb...@gmail.com javascript:;] Sent: Friday, January 03, 2014 3:22 PM To: user@hbase.apache.org javascript:; Subject: Re: secondary index feature Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I doubt whether it is there large no# RSs. Do you have some data Rajesh Babu? -Anoop- On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and co-processor based) indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table
Re: secondary index feature
All that sounds very promising. I will give it a try and let you know how things worked out. Thanks, Henning On 12/23/2013 08:10 PM, Jesse Yates wrote: The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylor jtay...@salesforce.comwrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data-guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm henning.bl...@zfabrik.dewrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R Job) So validating whether an index entry is valid means to load the entity pointed to and - before considering it a valid result - validating whether values of the entity still match with the index. The entity is written last, hence when the client dies halfway through the update you may get stale index entries but nothing else should break. For scanning along the index, we are using a chunk iterator that is, we read n index entries ahead and then do point look ups for the entities. How would you avoid point-gets when scanning via an index (as most likely, entities are ordered independently from the index - hence the index)? Something really important to note is that there is no intention to build a completely generic solution, in particular not (this time - unlike the other post of mine you responded to) taking row versioning into account. Instead, row time stamps are used to delete stale entries (old entries after an index rebuild). Thanks a lot for your blog pointers. Haven't had time to study in depth but at first glance there is lot of overlap of what you are proposing and what I ended up doing considering the first post. On the second post: Indeed I have not worried too much about transactional isolation of updates. If index update and entity update use the same HBase time stamp, the result should at least be consistent, right? Btw. in no way am I claiming originality of my thoughts - in particular I read http://jyates.github.io/2012/07/09/consistent-enough- secondary-indexes.html a while back. Thanks, Henning Ps.: I might write about this discussion later in my blog On 22.12.2013 23:37, lars hofhansl wrote: The devil is often in the details. On the surface it looks simple. How specifically are the stale indexes ignored? Are the guaranteed to be no races? Is deletion handled correctly?Does it work with multiple versions? What happens when the client dies 1/2 way through an update? It's easy to do eventually consistent indexes. Truly consistent indexes without transactions are tricky. Also, scanning an index and then doing point-gets against a main table is slow (unless the index is very selective. The Phoenix team measured that there is only an advantage if the index filters out 98-99% of the data). So then one would revert to covered indexes and suddenly is not so easy to detect stale index entries. I blogged about these issues here: http://hadoop-hbase.blogspot.com/2012/10/musings-on- secondary-indexes.html
Re: secondary index feature
Thanks for pointing to Lily. I read about it, but it seems to add significant additional infrastructure, which I am trying to avoid out of fear of adding unwanted complexity. That may be unjustified, and I may need to take another look. Henning On 22.12.2013 17:41, Anoop John wrote: HIndex is local indexing mechanism. It is per region indexing. Pheonix is not yet having a local indexing mechanism.(Global indexing is in place) Phoenix team do have roadmap for that. You have done some thing like Lily Henning? -Anoop- m On Sun, Dec 22, 2013 at 9:39 PM, Ted Yu yuzhih...@gmail.com wrote: The library from Huawei is the basis for HBASE-10222 Rajeshbabu works for Huawei. Cheers On Sun, Dec 22, 2013 at 8:00 AM, Pradeep Gollakota pradeep...@gmail.com wrote: I lied in my previous email... it doesn't look like Phoenix uses HIndex. On Sun, Dec 22, 2013 at 3:53 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Take a look at this library from Huawei. They went a step further to colocate the index with the primary partition. I believe Phoenix uses it for its indexing. https://github.com/Huawei-Hadoop/hindex On Sun, Dec 22, 2013 at 1:34 PM, Ted Yu yuzhih...@gmail.com wrote: Rajeshbabu is working on HBASE-10222 which adds support of secondary index to HBASE core. FYI On Dec 22, 2013, at 2:11 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Lately we have added a secondary index feature to a persistence tier over HBASE. Essentially we implemented what is described as Dual-Write Secondary Index in http://hbase.apache.org/book/secondary.indexes.html . I.e. while updating an entity, actually before writing the actual update, indexes are updated. Lookup via the index ignores stale entries. A recurring rebuild and clean out of stale entries takes care the indexes are trimmed and accurate. None of this was terribly complex to implement. In fact, it seemed like something you could do generically, maybe not on the HBASE level itself, but as a toolbox / utility style library. Is anybody on the list aware of anything useful already existing in that space? Thanks, Henning Blohm *ZFabrik Software KG* T:+49 6227 3984255https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# F:+49 6227 3984254https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# M:+49 1781891820https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# Lammstrasse 2 69190 Walldorf henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628 ZFabrik http://www.zfabrik.de Blog http://www.z2-environment.net/blog Z2-Environment http://www.z2-environment.eu Z2 Wiki http://redmine.z2-environment.net -- Henning Blohm *ZFabrik Software KG* T: +49 6227 3984255 F: +49 6227 3984254 M: +49 1781891820 Lammstrasse 2 69190 Walldorf henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628 ZFabrik http://www.zfabrik.de Blog http://www.z2-environment.net/blog Z2-Environment http://www.z2-environment.eu Z2 Wiki http://redmine.z2-environment.net
Re: secondary index feature
Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R Job) So validating whether an index entry is valid means to load the entity pointed to and - before considering it a valid result - validating whether values of the entity still match with the index. The entity is written last, hence when the client dies halfway through the update you may get stale index entries but nothing else should break. For scanning along the index, we are using a chunk iterator that is, we read n index entries ahead and then do point look ups for the entities. How would you avoid point-gets when scanning via an index (as most likely, entities are ordered independently from the index - hence the index)? Something really important to note is that there is no intention to build a completely generic solution, in particular not (this time - unlike the other post of mine you responded to) taking row versioning into account. Instead, row time stamps are used to delete stale entries (old entries after an index rebuild). Thanks a lot for your blog pointers. Haven't had time to study in depth but at first glance there is lot of overlap of what you are proposing and what I ended up doing considering the first post. On the second post: Indeed I have not worried too much about transactional isolation of updates. If index update and entity update use the same HBase time stamp, the result should at least be consistent, right? Btw. in no way am I claiming originality of my thoughts - in particular I read http://jyates.github.io/2012/07/09/consistent-enough-secondary-indexes.html a while back. Thanks, Henning Ps.: I might write about this discussion later in my blog On 22.12.2013 23:37, lars hofhansl wrote: The devil is often in the details. On the surface it looks simple. How specifically are the stale indexes ignored? Are the guaranteed to be no races? Is deletion handled correctly?Does it work with multiple versions? What happens when the client dies 1/2 way through an update? It's easy to do eventually consistent indexes. Truly consistent indexes without transactions are tricky. Also, scanning an index and then doing point-gets against a main table is slow (unless the index is very selective. The Phoenix team measured that there is only an advantage if the index filters out 98-99% of the data). So then one would revert to covered indexes and suddenly is not so easy to detect stale index entries. I blogged about these issues here: http://hadoop-hbase.blogspot.com/2012/10/musings-on-secondary-indexes.html http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes-part-ii.html Phoenix has a (pretty involved) solution now that works around the fact that HBase has no transactions. -- Lars From: Henning Blohm henning.bl...@zfabrik.de To: user user@hbase.apache.org Sent: Sunday, December 22, 2013 2:11 AM Subject: secondary index feature Lately we have added a secondary index feature to a persistence tier over HBASE. Essentially we implemented what is described as Dual-Write Secondary Index in http://hbase.apache.org/book/secondary.indexes.html. I.e. while updating an entity, actually before writing the actual update, indexes are updated. Lookup via the index ignores stale entries. A recurring rebuild and clean out of stale entries takes care the indexes are trimmed and accurate. None of this was terribly complex to implement. In fact, it seemed like something you could do generically, maybe not on the HBASE level itself, but as a toolbox / utility style library. Is anybody on the list aware of anything useful already existing in that space? Thanks, Henning Blohm *ZFabrik Software KG* T: +49 6227 3984255 F: +49 6227 3984254 M: +49 1781891820 Lammstrasse 2 69190 Walldorf henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628 ZFabrik http://www.zfabrik.de Blog http://www.z2-environment.net/blog Z2-Environment http://www.z2-environment.eu Z2 Wiki http://redmine.z2-environment.net -- Henning Blohm *ZFabrik Software KG* T: +49 6227 3984255 F: +49 6227 3984254 M: +49 1781891820 Lammstrasse 2 69190 Walldorf henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628 ZFabrik http://www.zfabrik.de Blog http://www.z2-environment.net/blog Z2-Environment http://www.z2-environment.eu
Re: secondary index feature
Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data-guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm henning.bl...@zfabrik.dewrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R Job) So validating whether an index entry is valid means to load the entity pointed to and - before considering it a valid result - validating whether values of the entity still match with the index. The entity is written last, hence when the client dies halfway through the update you may get stale index entries but nothing else should break. For scanning along the index, we are using a chunk iterator that is, we read n index entries ahead and then do point look ups for the entities. How would you avoid point-gets when scanning via an index (as most likely, entities are ordered independently from the index - hence the index)? Something really important to note is that there is no intention to build a completely generic solution, in particular not (this time - unlike the other post of mine you responded to) taking row versioning into account. Instead, row time stamps are used to delete stale entries (old entries after an index rebuild). Thanks a lot for your blog pointers. Haven't had time to study in depth but at first glance there is lot of overlap of what you are proposing and what I ended up doing considering the first post. On the second post: Indeed I have not worried too much about transactional isolation of updates. If index update and entity update use the same HBase time stamp, the result should at least be consistent, right? Btw. in no way am I claiming originality of my thoughts - in particular I read http://jyates.github.io/2012/07/09/consistent-enough- secondary-indexes.html a while back. Thanks, Henning Ps.: I might write about this discussion later in my blog On 22.12.2013 23:37, lars hofhansl wrote: The devil is often in the details. On the surface it looks simple. How specifically are the stale indexes ignored? Are the guaranteed to be no races? Is deletion handled correctly?Does it work with multiple versions? What happens when the client dies 1/2 way through an update? It's easy to do eventually consistent indexes. Truly consistent indexes without transactions are tricky. Also, scanning an index and then doing point-gets against a main table is slow (unless the index is very selective. The Phoenix team measured that there is only an advantage if the index filters out 98-99% of the data). So then one would revert to covered indexes and suddenly is not so easy to detect stale index entries. I blogged about these issues here: http://hadoop-hbase.blogspot.com/2012/10/musings-on- secondary-indexes.html http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes-part-ii.html Phoenix has a (pretty involved) solution now that works around the fact that HBase has no transactions. -- Lars From: Henning Blohm henning.bl...@zfabrik.de To: user user@hbase.apache.org Sent: Sunday, December 22, 2013 2:11 AM Subject: secondary index feature Lately we have added a secondary index feature to a persistence tier over HBASE. Essentially we implemented what is described as Dual-Write Secondary Index in http://hbase.apache.org/book/secondary.indexes.html. I.e. while updating an entity, actually before writing the actual update, indexes are updated. Lookup via the index ignores stale entries. A recurring rebuild and clean out of stale entries takes care the indexes are trimmed and accurate. None of this was terribly complex to implement. In fact, it seemed like something you could do generically, maybe not on the HBASE level itself, but as a toolbox / utility style library. Is anybody on the list aware of anything useful already existing in that space? Thanks, Henning Blohm *ZFabrik Software
Re: secondary index feature
James, that is super interesting material! Thanks, Henning On 23.12.2013 19:01, James Taylor wrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data-guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm henning.bl...@zfabrik.dewrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R Job) So validating whether an index entry is valid means to load the entity pointed to and - before considering it a valid result - validating whether values of the entity still match with the index. The entity is written last, hence when the client dies halfway through the update you may get stale index entries but nothing else should break. For scanning along the index, we are using a chunk iterator that is, we read n index entries ahead and then do point look ups for the entities. How would you avoid point-gets when scanning via an index (as most likely, entities are ordered independently from the index - hence the index)? Something really important to note is that there is no intention to build a completely generic solution, in particular not (this time - unlike the other post of mine you responded to) taking row versioning into account. Instead, row time stamps are used to delete stale entries (old entries after an index rebuild). Thanks a lot for your blog pointers. Haven't had time to study in depth but at first glance there is lot of overlap of what you are proposing and what I ended up doing considering the first post. On the second post: Indeed I have not worried too much about transactional isolation of updates. If index update and entity update use the same HBase time stamp, the result should at least be consistent, right? Btw. in no way am I claiming originality of my thoughts - in particular I read http://jyates.github.io/2012/07/09/consistent-enough- secondary-indexes.html a while back. Thanks, Henning Ps.: I might write about this discussion later in my blog On 22.12.2013 23:37, lars hofhansl wrote: The devil is often in the details. On the surface it looks simple. How specifically are the stale indexes ignored? Are the guaranteed to be no races? Is deletion handled correctly?Does it work with multiple versions? What happens when the client dies 1/2 way through an update? It's easy to do eventually consistent indexes. Truly consistent indexes without transactions are tricky. Also, scanning an index and then doing point-gets against a main table is slow (unless the index is very selective. The Phoenix team measured that there is only an advantage if the index filters out 98-99% of the data). So then one would revert to covered indexes and suddenly is not so easy to detect stale index entries. I blogged about these issues here: http://hadoop-hbase.blogspot.com/2012/10/musings-on- secondary-indexes.html http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes-part-ii.html Phoenix has a (pretty involved) solution now that works around the fact that HBase has no transactions. -- Lars From: Henning Blohm henning.bl...@zfabrik.de To: user user@hbase.apache.org Sent: Sunday, December 22, 2013 2:11 AM Subject: secondary index feature Lately we have added a secondary index feature to a persistence tier over HBASE. Essentially we implemented what is described as Dual-Write Secondary Index in http://hbase.apache.org/book/secondary.indexes.html. I.e. while updating an entity, actually before writing the actual update, indexes are updated. Lookup via the index ignores stale entries. A recurring rebuild and clean out of stale entries takes care the indexes are trimmed and accurate. None of this was terribly complex to implement. In fact, it seemed like something you could do generically, maybe not on the HBASE level itself, but as a toolbox / utility style library. Is anybody on the list aware of anything useful already existing in that space? Thanks,
Re: secondary index feature
The work that James is referencing grew out of the discussions Lars and I had (which lead to those blog posts). The solution we implement is designed to be generic, as James mentioned above, but was written with all the hooks necessary for Phoenix to do some really fast updates (or skipping updates in the case where there is no change). You should be able to plug in your own simple index builder (there is an example in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/master/src/main/java/com/salesforce/hbase/index/covered/example) to basic solution which supports the same transactional guarantees as HBase (per row) + data guarantees across the index rows. There are more details in the presentations James linked. I'd love you see if your implementation can fit into the framework we wrote - we would be happy to work to see if it needs some more hooks or modifications - I have a feeling this is pretty much what you guys will need -Jesse On Mon, Dec 23, 2013 at 10:01 AM, James Taylor jtay...@salesforce.comwrote: Henning, Jesse Yates wrote the back-end of our global secondary indexing system in Phoenix. He designed it as a separate, pluggable module with no Phoenix dependencies. Here's an overview of the feature: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section that discusses the data guarantees and failure management might be of interest to you: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data-guarantees-and-failure-management This presentation also gives a good overview of the pluggability of his implementation: http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx Thanks, James On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm henning.bl...@zfabrik.dewrote: Lars, that is exactly why I am hesitant to use one the core level generic approaches (apart from having difficulties to identify the still active projects): I have doubts I can sufficiently explain to myself when and where they fail. With toolbox approach I meant to say that turning entity data into index data is not done generically but rather involving domain specific application code that - indicates what makes an index key given an entity - indicates whether an index entry is still valid given an entity That code is also used during the index rebuild and trimming (an M/R Job) So validating whether an index entry is valid means to load the entity pointed to and - before considering it a valid result - validating whether values of the entity still match with the index. The entity is written last, hence when the client dies halfway through the update you may get stale index entries but nothing else should break. For scanning along the index, we are using a chunk iterator that is, we read n index entries ahead and then do point look ups for the entities. How would you avoid point-gets when scanning via an index (as most likely, entities are ordered independently from the index - hence the index)? Something really important to note is that there is no intention to build a completely generic solution, in particular not (this time - unlike the other post of mine you responded to) taking row versioning into account. Instead, row time stamps are used to delete stale entries (old entries after an index rebuild). Thanks a lot for your blog pointers. Haven't had time to study in depth but at first glance there is lot of overlap of what you are proposing and what I ended up doing considering the first post. On the second post: Indeed I have not worried too much about transactional isolation of updates. If index update and entity update use the same HBase time stamp, the result should at least be consistent, right? Btw. in no way am I claiming originality of my thoughts - in particular I read http://jyates.github.io/2012/07/09/consistent-enough- secondary-indexes.html a while back. Thanks, Henning Ps.: I might write about this discussion later in my blog On 22.12.2013 23:37, lars hofhansl wrote: The devil is often in the details. On the surface it looks simple. How specifically are the stale indexes ignored? Are the guaranteed to be no races? Is deletion handled correctly?Does it work with multiple versions? What happens when the client dies 1/2 way through an update? It's easy to do eventually consistent indexes. Truly consistent indexes without transactions are tricky. Also, scanning an index and then doing point-gets against a main table is slow (unless the index is very selective. The Phoenix team measured that there is only an advantage if the index filters out 98-99% of the data). So then one would revert to covered indexes and suddenly is not so easy to detect stale index entries. I blogged about these issues here: http://hadoop-hbase.blogspot.com/2012/10/musings-on- secondary-indexes.html http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes-part-ii.html Phoenix has a (pretty
Re: secondary index feature
I lied in my previous email... it doesn't look like Phoenix uses HIndex. On Sun, Dec 22, 2013 at 3:53 PM, Pradeep Gollakota pradeep...@gmail.comwrote: Take a look at this library from Huawei. They went a step further to colocate the index with the primary partition. I believe Phoenix uses it for its indexing. https://github.com/Huawei-Hadoop/hindex On Sun, Dec 22, 2013 at 1:34 PM, Ted Yu yuzhih...@gmail.com wrote: Rajeshbabu is working on HBASE-10222 which adds support of secondary index to HBASE core. FYI On Dec 22, 2013, at 2:11 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Lately we have added a secondary index feature to a persistence tier over HBASE. Essentially we implemented what is described as Dual-Write Secondary Index in http://hbase.apache.org/book/secondary.indexes.html. I.e. while updating an entity, actually before writing the actual update, indexes are updated. Lookup via the index ignores stale entries. A recurring rebuild and clean out of stale entries takes care the indexes are trimmed and accurate. None of this was terribly complex to implement. In fact, it seemed like something you could do generically, maybe not on the HBASE level itself, but as a toolbox / utility style library. Is anybody on the list aware of anything useful already existing in that space? Thanks, Henning Blohm *ZFabrik Software KG* T:+49 6227 3984255 F:+49 6227 3984254 M:+49 1781891820 Lammstrasse 2 69190 Walldorf henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628 ZFabrik http://www.zfabrik.de Blog http://www.z2-environment.net/blog Z2-Environment http://www.z2-environment.eu Z2 Wiki http://redmine.z2-environment.net
Re: secondary index feature
The library from Huawei is the basis for HBASE-10222 Rajeshbabu works for Huawei. Cheers On Sun, Dec 22, 2013 at 8:00 AM, Pradeep Gollakota pradeep...@gmail.comwrote: I lied in my previous email... it doesn't look like Phoenix uses HIndex. On Sun, Dec 22, 2013 at 3:53 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Take a look at this library from Huawei. They went a step further to colocate the index with the primary partition. I believe Phoenix uses it for its indexing. https://github.com/Huawei-Hadoop/hindex On Sun, Dec 22, 2013 at 1:34 PM, Ted Yu yuzhih...@gmail.com wrote: Rajeshbabu is working on HBASE-10222 which adds support of secondary index to HBASE core. FYI On Dec 22, 2013, at 2:11 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Lately we have added a secondary index feature to a persistence tier over HBASE. Essentially we implemented what is described as Dual-Write Secondary Index in http://hbase.apache.org/book/secondary.indexes.html . I.e. while updating an entity, actually before writing the actual update, indexes are updated. Lookup via the index ignores stale entries. A recurring rebuild and clean out of stale entries takes care the indexes are trimmed and accurate. None of this was terribly complex to implement. In fact, it seemed like something you could do generically, maybe not on the HBASE level itself, but as a toolbox / utility style library. Is anybody on the list aware of anything useful already existing in that space? Thanks, Henning Blohm *ZFabrik Software KG* T:+49 6227 3984255 F:+49 6227 3984254 M:+49 1781891820 Lammstrasse 2 69190 Walldorf henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628 ZFabrik http://www.zfabrik.de Blog http://www.z2-environment.net/blog Z2-Environment http://www.z2-environment.eu Z2 Wiki http://redmine.z2-environment.net
Re: secondary index feature
HIndex is local indexing mechanism. It is per region indexing. Pheonix is not yet having a local indexing mechanism.(Global indexing is in place) Phoenix team do have roadmap for that. You have done some thing like Lily Henning? -Anoop- On Sun, Dec 22, 2013 at 9:39 PM, Ted Yu yuzhih...@gmail.com wrote: The library from Huawei is the basis for HBASE-10222 Rajeshbabu works for Huawei. Cheers On Sun, Dec 22, 2013 at 8:00 AM, Pradeep Gollakota pradeep...@gmail.com wrote: I lied in my previous email... it doesn't look like Phoenix uses HIndex. On Sun, Dec 22, 2013 at 3:53 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Take a look at this library from Huawei. They went a step further to colocate the index with the primary partition. I believe Phoenix uses it for its indexing. https://github.com/Huawei-Hadoop/hindex On Sun, Dec 22, 2013 at 1:34 PM, Ted Yu yuzhih...@gmail.com wrote: Rajeshbabu is working on HBASE-10222 which adds support of secondary index to HBASE core. FYI On Dec 22, 2013, at 2:11 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Lately we have added a secondary index feature to a persistence tier over HBASE. Essentially we implemented what is described as Dual-Write Secondary Index in http://hbase.apache.org/book/secondary.indexes.html . I.e. while updating an entity, actually before writing the actual update, indexes are updated. Lookup via the index ignores stale entries. A recurring rebuild and clean out of stale entries takes care the indexes are trimmed and accurate. None of this was terribly complex to implement. In fact, it seemed like something you could do generically, maybe not on the HBASE level itself, but as a toolbox / utility style library. Is anybody on the list aware of anything useful already existing in that space? Thanks, Henning Blohm *ZFabrik Software KG* T:+49 6227 3984255https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# F:+49 6227 3984254https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# M:+49 1781891820https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# Lammstrasse 2 69190 Walldorf henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628 ZFabrik http://www.zfabrik.de Blog http://www.z2-environment.net/blog Z2-Environment http://www.z2-environment.eu Z2 Wiki http://redmine.z2-environment.net
Re: secondary index feature
The devil is often in the details. On the surface it looks simple. How specifically are the stale indexes ignored? Are the guaranteed to be no races? Is deletion handled correctly?Does it work with multiple versions? What happens when the client dies 1/2 way through an update? It's easy to do eventually consistent indexes. Truly consistent indexes without transactions are tricky. Also, scanning an index and then doing point-gets against a main table is slow (unless the index is very selective. The Phoenix team measured that there is only an advantage if the index filters out 98-99% of the data). So then one would revert to covered indexes and suddenly is not so easy to detect stale index entries. I blogged about these issues here: http://hadoop-hbase.blogspot.com/2012/10/musings-on-secondary-indexes.html http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes-part-ii.html Phoenix has a (pretty involved) solution now that works around the fact that HBase has no transactions. -- Lars From: Henning Blohm henning.bl...@zfabrik.de To: user user@hbase.apache.org Sent: Sunday, December 22, 2013 2:11 AM Subject: secondary index feature Lately we have added a secondary index feature to a persistence tier over HBASE. Essentially we implemented what is described as Dual-Write Secondary Index in http://hbase.apache.org/book/secondary.indexes.html. I.e. while updating an entity, actually before writing the actual update, indexes are updated. Lookup via the index ignores stale entries. A recurring rebuild and clean out of stale entries takes care the indexes are trimmed and accurate. None of this was terribly complex to implement. In fact, it seemed like something you could do generically, maybe not on the HBASE level itself, but as a toolbox / utility style library. Is anybody on the list aware of anything useful already existing in that space? Thanks, Henning Blohm *ZFabrik Software KG* T: +49 6227 3984255 F: +49 6227 3984254 M: +49 1781891820 Lammstrasse 2 69190 Walldorf henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628 ZFabrik http://www.zfabrik.de Blog http://www.z2-environment.net/blog Z2-Environment http://www.z2-environment.eu Z2 Wiki http://redmine.z2-environment.net