This is awesome, thanks for taking charge on this Christoph. HBaseStorage could has grown to the point that it can use some love. A few comments:
- Yes, HBaseStorage has gotten too big and we need to think about how to break it up a bit. Maybe we make it into a composite or we break the storage/loader parts out somehow, but the changes will need to be backwards compatible. I think this work should be done in it's own jira without any new functionality. - There has been discussion in the past about supporting returning multiple versions of a cell with timestamps. The thought was that this would produce a different schema and would be in a new storage/loader class. The idea is that you'd get one row per rk, but each descriptor field would have a tuple of two-tuples (ts, value). Would this work for your needs instead of producing multiple rows per rk? Producing multiple rows per rk would require some tricky grouping to get specific fields out, especially if the cell values don't share common timestamps. If we had one tuple per rk, that would lend itself to UDFs that could operate on each fields cell values. - What you're proposing re the snapshot functionality is great, but I think the syntax is a bit confusing. The term 'snapshot' might mean different things to different people, but if we talk in terms of cell timestamps I think it will make the implied functionality very clear. Also, speaking in terms of greater than or less then helps. This would also align with the current syntax. I'm thinking of options like this: -cellTsLt -cellTsGt -cellTsEquals -cellTsLte -cellTsGte -cellLimit Would that work for your use case? That would allow us to support returning multiple cell versions or just one with that syntax. cellLimit would default to 1, but you could set it > 1 to get back multiple version of a cell. Let me know what you think. thanks, Bill On Thu, Nov 8, 2012 at 7:18 AM, Christoph Bauer <[email protected]>wrote: > Hi, > > here at postdirekt we have need for a lot more timestamp handling in > HBaseStorage then there is. We're starting on a patch to pig. > > I think there are many people out there who would welcome those changes and > we are willing to pass that patch on to the community if it is desired. > > So there is a short proposal here: > > https://cwiki.apache.org/confluence/display/PIG/HBaseStorage+Timestamp+Extensions > > We're also open to other changes. So please reply. > > > I have a question: > HBaseStorage is getting really big and could do with splitting up into > smaller parts to make it readable again. Would this require a patch on its > own? > > regards, > Christoph Bauer > -- *Note that I'm no longer using my Yahoo! email address. Please email me at [email protected] going forward.*
