On Thu, Oct 17, 2013 at 3:38 PM, Josh Wills <[email protected]> wrote:
> My feeling is that the consensus here is that adding a new PTypeFamily is a > bad idea. :) > > The other idea I had would be to add a way for the Source and Target to > indicate that they were reading input data directly from the Hadoop > serialization framework, and thus did not need the input/output PTypes to > perform any additional transforms via getInputMapFn/getOutputMapFn. We > would still need different PTypes for working with the HBase objects (along > the lines that Gabriel mentioned earlier in the thread), but this approach > would solve the core issue w/o requiring a new PTypeFamily. > Is there any chance that this is just as simple as (re)implementing the getConverter method in the HBase-related Source and Target impls? > > > On Wed, Oct 16, 2013 at 7:17 PM, Micah Whitacre <[email protected]> wrote: > > > If we created a new PTypeFamily we'd need to build in support to the > Avros > > (and possibly Writables) class to support wrapping the HBaseTypeFamily > > types. > > > > > > On Wed, Oct 16, 2013 at 10:20 AM, Gabriel Reid <[email protected] > > >wrote: > > > > > Ok, makes sense. And yeah, going from a Put to bytes and then back to a > > > Put in order to write to HBase doesn't sound too awesome. > > > > > > > > > On Wed, Oct 16, 2013 at 5:10 PM, Josh Wills <[email protected]> > > wrote: > > > > > > > On Wed, Oct 16, 2013 at 8:02 AM, Gabriel Reid < > [email protected] > > > > >wrote: > > > > > > > > > On Wed, Oct 16, 2013 at 4:34 PM, Josh Wills <[email protected]> > > > wrote: > > > > > > > > > > > On Wed, Oct 16, 2013 at 12:15 AM, Gabriel Reid < > > > [email protected] > > > > > > >wrote: > > > > > > > > > > > > > Wouldn't a derived PType (like in o.a.c.types.PTypes) be a > better > > > fit > > > > > > here? > > > > > > > > > > > > > > > > > > > That was my initial attempt, and in an ideal world, my preferred > > > > > solution-- > > > > > > but I haven't figured out how to make it work. The question here > > is: > > > > what > > > > > > do I derive a KeyValue object to? What I really want, for > purposes > > of > > > > > > reading it/writing it to one of our HBase IO formats, is to map > it > > to > > > > > > itself, and not some subclass of Writable. Another option might > be > > an > > > > > > extension of WritableType to handle these special case formats-- > > I'll > > > > > take > > > > > > a crack at getting that to work. > > > > > > > > > > > > > > > > I'm sure I'm just missing something obvious, but I don't totally > get > > > it. > > > > > What I had > > > > > in my head is that KeyValue, Put, Delete, Result, etc could all be > > > > derived > > > > > to byte > > > > > arrays, with the KeyValueSerialization, MutationSerialization, and > > > > > ResultSerialization > > > > > classes being used in the MapFns within the derived PType to go > > between > > > > the > > > > > type and its byte representation, i.e. > > > > > > > > > > public static PType<KeyValue> keyValue(PTypeFamily ptf) { > > > > > return ptf.derived( > > > > > KeyValue.class, > > > > > BYTES_TO_KEYVALUE_VIA_KVSERIALIZATION, > > > > > KEYVALUE_TO_BYTES_VIA_KVSERIALIZATION, > > > > > ptf.bytes()); > > > > > } > > > > > > > > > > I'm guessing this is the same thing you're talking about, which I > > > assume > > > > > means that > > > > > I'm missing something simple as to why that wouldn't just work, but > > I'm > > > > not > > > > > sure > > > > > what it is that I'm missing. > > > > > > > > > > > > > > The rub is the Input and Output formats, which don't expect bytes-- > > they > > > > expect either subclasses of the Mutation interface (Put or Delete), > or > > > > KeyValue (for HFile) or Result (for HTable) inputs. So we would need > to > > > > change the input and output formats so that they would take in bytes > as > > > > arguments and then convert them back to the objects that the HBase > APIs > > > > expect, so something like: > > > > > > > > getOutputMapFn() -> OutputFormat > > > > Put -> bytes() -> Put > > > > > > > > That isn't the end of the world, it's just a little odd. We'd need to > > do > > > > something similar on the Input format side as well, so like: > > > > > > > > InputFormat -> getInputMapFn() > > > > Result -> bytes() -> Result > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > A whole new PTypeFamily sounds like a lot of work (unless maybe > > if > > > it > > > > > > was a > > > > > > > subclass of one of the existing ones), and I think there's > still > > a > > > > fair > > > > > > bit > > > > > > > of code > > > > > > > that assumes that Avro & Writable are the only two possible > > > > PTypeFamily > > > > > > > implementations. > > > > > > > > > > > > > > > > > > > For any kind of intermediate processing, that is still true. The > > > > > > HBaseTypeFamily would only ever really appear at the input or > > output > > > > for > > > > > a > > > > > > job. > > > > > > > > > > > > > > > > > True, although of course it would be nice if we wouldn't have that > > > > > limitation. > > > > > > > > > > - Gabriel > > > > > > > > > > > > > > > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
