Hi Nick, Happy New Year to you too! I'd be happy to give you an overview of secondary indexes. When you declare a table with IMMUTABLE_ROWS=true, you're telling Phoenix that you have write-once/append-only data (i.e. a row will never be updated in-place). This makes the incremental maintenance of a secondary index much easer because you simply need to generate the correct index row based on your data row (note that with Phoenix, index table rows are 1:1 with data table rows and index tables are just regular old HBase tables with a different row key structure). This logic is encapsulated in IndexMaintainer.java. With mutable tables you need to do this too, but you also need to delete the index row for current value which complicates things.
Because indexes over immutable data don't require additional processing during incremental maintenance, we can maintain the index on the client side by simply generating the index row when the data row is upserted (see MutationState.java and the addRowMutations method). The batch of Puts for the index updates get send over after the batch of Puts for the data table (see loop in send method). There's an optimization for local indexes, however. Since we know that the index data resides on the same region server as the table table, we can generate the index row through a coprocessor on the server side instead of sending it over the wire. Deletes for write-once data are handled differently, by generating the equivalent SQL statement against the index table that was run against the data table. The reason this is done is because we don't necessarily have the information required on the client side when we issue a delete of a data row. For example, if a column represented by a KeyValue in HBase is indexed, when the row is deleted, we don't know what the value of the KeyValue was. In terms of data formats, the index row key is defined by the columns you're indexing. We also tack on any pk column values from the data table that aren't indexed (that's how we ensure the 1:1 relationship between data rows and index rows). We have to translate the data type of the indexed data table column sometimes (see IndexUtil.getIndexColumnDataType()). The reason is that for fixed width types (such as INTEGER and BIGINT), we don't have a representation for null. We need to be able to represent null because in the index, the column value is appearing in the row key and we need null values to appear before any other value. The way we do this is by translating from INTEGER/BIGINT to DECIMAL for the index table. You can see this if you examine the schema of an index table and compare it against the schema of a data table (you can do this easily in SQuirrel). Failure scenarios are interesting to think about. For example, what happens if a RS hosting an index region goes down right after you've updated the data region? For immutable indexes, you're basically SOL (though the story improves with transactions - more on that later). At this point, your index and data table are out-of-sync. Since your data is immutable, you can safely replay your commit until it succeeds. If you wanted to, you could even force the upper bound of your queries to be "frozen" at the timestamp at which you know the index is valid. Another option would be to mark the index as disabled (so it's no longer used), and rebuild it in the background. These are all techniques that an application can do to deal with failures, but Phoenix doesn't take any action (other than throwing whatever exception occurred at write time so the client knows about it). For mutable indexes, we do a bit more, but it's far from perfect as well. By default, if a write failure occurs, we mark the index as disabled and have a background task that tries to catch the index up once the region can be reached again (followed by enabling the index again). We also write the index updates with the data updates in the WAL, so if the RS goes down, we'll write the data updates and index updates when it comes back up. There's also some work underway (PHOENIX-2221) to fail writes to data tables if the index table cannot be written to. Another consideration, depending on your use case is write skew. It's possible that a batch of mutations may hit the data table before they hit the index table (based on how the updates occur as described above). Until both batches are processed, the index and data table are not *exactly* in sync. With support for transactions, secondary index updates can be transactional wrt data updates, in which case there's no scenario where your data table and index table get out of sync (including the write skew case). For write-once/append-only data, this works well because there's not a lot of overhead for transactions (since by definition, no write conflicts are possible, so no conflict detection is necessary). HTH. Please let me know if you'd like more info as you get into the code more. Thanks, James On Mon, Jan 4, 2016 at 9:43 AM, Nick Dimiduk <[email protected]> wrote: > Happy New Year everyone! > > I'd like to come up to speed on our secondary index implementations. I've > combed through the doc page [0] and experimented with tweaking index > definitions and examining changes to the explain plan output. Now I'd like > to get deeper into index formats, maintenance strategies, &c. I'm mostly > interested in the IMMUTABLE_ROWS=true details. Do any of you have a doc, > deck, or video I can watch to go the next step? > > Thanks a lot, > -n > > [0]: http://phoenix.apache.org/secondary_indexing.html >
