Hi Lars, On Mon, Nov 9, 2009 at 11:59 PM, Lars Francke <[email protected]> wrote: > I've read numerous threads on this mailing list and I've asked several > times on IRC but the answers I get are rarely the same so I'd like to > try once more.
I think this is not a right / wrong kind of a question. HBase gives you several options to do this, and that's why people give you different suggestions. > I have a data model that would be a perfect match for the > versions/timestamps that are available in HBase. Some say that it is > perfectly feasible to use the versions as another "data dimension" and > some say that it isn't meant to be used that way at all. The BigTable > paper doesn't go into very much detail about this but from what I > gathered it is indeed used as an additional dimension. I'm thinking the same way to you; it can be used as an additional dimension. [ The model that uses the versions as a data dimension ] > In my data model the versions would start at 1 and be ascending - no > timestamps but HBase doesn't enforce those. Not sure if I understand you correctly. A row doesn't have version value in its key but only user specified id and timestamp. Please see the KeyValue's section of this wonderful blog post by Lars George. http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html > The upside of this model > would be that only the difference between two versions would have to > be saved and that I'd be provided with a nice API to handle versions. Agreed. [ The model that does not use the versions but compound row key ] > The model proposed to me numerous times using a compound row key > (model id:version) would save duplicates of the data (or I'd have to > handle the diffs myself). That's right. > Another upside would be that it would > require only a Get to get an element and its history. I don't think this is acculate. To get its history at once, you will not use Get but Scan with a prefix key (model id) Also, with the earlier model, you can still get its history with a single Get. (Get has #setMaxVersions(int)) So, both models can do this. I think the upside of the latter model (compound row key) is that you can get a specific version very quickly because the version value is a part of the key. The earlier model needs you to iterate through all history and look at their timestamps to find the right version. > I require "out of order" insertion to the versions and I was told that > this is probably no problem as long as I don't delete a version. Is > this true? I don't have the answer. You might want to try it by yourself. > I know that there is a limit for versions (Integer.MAX_VALUE as far as > I can see) and for some of my tables this will be a problem so I'd end > up using a mix of both these models anyway but if possible I'd like to > use the version model provided by HBase where I can. I haven't seen a > single example schema, tutorial, ... that talks about the versions in > schemas; they seem to go mainly unused. I couldn't find examples to retrieve a column value with specific timestamp, and the 0.20.x API doesn't seem to have some convenience methods to do this. You'll have to call Result#sorted() to get sorted KeyValues, or Result#getMap() to get NavigableMaps. Then you'll iterate thorough one of them to find a specific column with a specific timestamp. > So my question would be: Should I use versions as an important part of > my schema or not? If not are there any tips/hints on management of > versions using compound keys and what the versions/timestamps are used > for if not as an additional data dimension? It depends on how often you will search for a specific version of a record. If you do this very often, I think the latter model (compound row key) will be easier to work with. Otherwise, the earlier model (use versions) can be the option. > And one more question about a "proper" schema: I have quite a lot of > places that merely save a list of things it relates to without > requiring any additional information (Many-to-Many). I'd have > introduced a new column family and used the columns as keys to another > table but I won't need the column value. How does HBase behave in > regard to "null" as a column value? The FAQ entry about this topic is > a bit unclear. Or is this the wrong way to begin with? I believe you can't literally give a "null" to a column, so use an empty (zero-length) byte array instead. Since it's a zero-length array, it doesn't waste any disk space. Hope this helps, -- Tatsuya Kawano (Mr.) Tokyo, Japan
