Re: Adding features to HBase Input Operators in Malhar-contrib

Sandeep Deshmukh Mon, 28 Mar 2016 05:55:36 -0700

I shall do that in a day or two.

Regards,
Sandeep


On Thu, Mar 24, 2016 at 6:10 PM, Bhupesh Chawda <[email protected]>
wrote:

> Dear Community,
>
> Can anyone help review the pull request:
> https://github.com/apache/incubator-apex-malhar/pull/212
>
> Thanks.
>
> ~Bhupesh
>
> On Thu, Mar 17, 2016 at 4:16 PM, Bhupesh Chawda <[email protected]>
> wrote:
>
> > Hi,
> >
> > I have opened a pull request for the changes as described in the previous
> > emails. Here is the pull request:
> > https://github.com/apache/incubator-apex-malhar/pull/212
> >
> > Here is a short description of the changes:
> >
> > HBaseInputOperator - Takes care of HBaseStore and its connection. Got rid
> > of HBaseOperatorBase.
> > HBaseScanOperator - Takes care of scanning the table in a non-blocking
> > manner. Exposes operationScan() and getTuple() as before.
> > HBasePOJOInputOperator - Implements operationScan() and getTuple() and
> > outputs a POJO on the output port.
> >
> > Please help review these changes.
> >
> > Thanks
> > ~Bhupesh
> >
> > On Fri, Mar 11, 2016 at 4:42 PM, Bhupesh Chawda <[email protected]
> >
> > wrote:
> >
> >> Hi All,
> >>
> >> In the current design of HBase input and output operators, the row key
> is
> >> hard-coded to be of String type.
> >> I foresee the following issue:
> >>
> >>    - In case of numeric keys which are type casted to String,
> *incremental
> >>    read* is problematic. For example, after reading key = 9, we may not
> >>    be able to read any record with say, key = 8888, when though
> numerically
> >>    8888 > 9, lexicographically "9" > "8888".
> >>    - This is the case only when data is being written to HBase and being
> >>    read from simultaneously.
> >>
> >> My suggestion is to parametrize the type of row key in the HBase input
> >> and output operators, and let the user instantiate the required type for
> >> row key. We can have default implementations for String and/ or Long. By
> >> parametrizing the row key type, the user can even use complex row keys
> >> which are a combination of multiple fields.
> >>
> >> Thoughts?
> >>
> >> PS: I understand that there is a performance concern in making a
> >> monotonically increasing key as the row key. Given that, how do we
> address
> >> the incremental read scenario?
> >>
> >> Thanks
> >>
> >> -Bhupesh
> >>
> >> On Wed, Dec 30, 2015 at 7:49 PM, Sandeep Deshmukh <
> >> [email protected]> wrote:
> >>
> >>> Looks fine to me.
> >>>
> >>> Regards,
> >>> Sandeep
> >>>
> >>> On Wed, Dec 30, 2015 at 7:34 PM, Bhupesh Chawda <
> [email protected]
> >>> >
> >>> wrote:
> >>>
> >>> > Here is the final hierarchy I am considering:
> >>> >
> >>> > HBaseInputOperator - Takes care of HBaseStore and its connection. Got
> >>> rid
> >>> > of HBaseOperatorBase.
> >>> >     HBaseScanOperator - Takes care of scanning the table in a
> >>> non-blocking
> >>> > manner. Exposes operationScan() and getTuple() as before.
> >>> >         HBasePOJOInputOperator - Implements operationScan() and
> >>> getTuple()
> >>> > and outputs a POJO on the output port.
> >>> >
> >>> > Comments?
> >>> >
> >>> > -Bhupesh
> >>> >
> >>> >
> >>> > On Wed, Dec 30, 2015 at 2:52 PM, Bhupesh Chawda <
> >>> [email protected]>
> >>> > wrote:
> >>> >
> >>> > > The class HBaseInputOperator seems to be quite old. HBaseStore
> seems
> >>> to
> >>> > be
> >>> > > having all the functionality provided by HBaseInputOperator and
> even
> >>> more
> >>> > > (including Kerberos authentication).
> >>> > >
> >>> > > It would be a good idea to avoid the usage of HBaseInputOperator
> >>> going
> >>> > > forward and use HBaseStore instead.
> >>> > >
> >>> > > I will also work on abstracting out the HBase input functionality
> in
> >>> the
> >>> > > HBaseInputOperator, which can be extended by concrete
> >>> implementations.
> >>> > >
> >>> > > -Bhupesh
> >>> > >
> >>> > > On Wed, Dec 23, 2015 at 7:47 PM, Bhupesh Chawda <
> >>> [email protected]
> >>> > >
> >>> > > wrote:
> >>> > >
> >>> > >> Thanks for the inputs.
> >>> > >> As an input operator, I am targeting just the Scan operation. Get
> >>> > >> operation may be supported better as a generic operator (like a
> >>> query
> >>> > >> operator) which I can take up later.
> >>> > >>
> >>> > >> -Bhupesh
> >>> > >>
> >>> > >> On Tue, Dec 22, 2015 at 3:48 PM, Mohit Jotwani <
> >>> [email protected]>
> >>> > >> wrote:
> >>> > >>
> >>> > >>> +1
> >>> > >>>
> >>> > >>> Regards,
> >>> > >>> Mohit
> >>> > >>>
> >>> > >>> On Tue, Dec 22, 2015 at 11:21 AM, Chinmay Kolhatkar <
> >>> > >>> [email protected]
> >>> > >>> > wrote:
> >>> > >>>
> >>> > >>> > +1 for above.
> >>> > >>> > I see that there is HbaseGetOperator but but its abstract no
> >>> concrete
> >>> > >>> > implementation of this I can find.
> >>> > >>> > Are you going to implement of that too?
> >>> > >>> >
> >>> > >>> > Maybe the concrete implementation of HbaseGetOperator should
> have
> >>> > this.
> >>> > >>> >
> >>> > >>> > Also, I want to mention one thing about scan from my previous
> >>> > >>> experience of
> >>> > >>> > Hbase. The Hbase client is synchronous.
> >>> > >>> > This means when you fire a scan call, until certain number of
> >>> records
> >>> > >>> are
> >>> > >>> > received at client end, the function blocks.
> >>> > >>> > This causes a lot of problems in the current thread as it might
> >>> just
> >>> > >>> get
> >>> > >>> > blocked for a long period of time.
> >>> > >>> > Plus, there are always network related latency to add to the
> >>> problem.
> >>> > >>> >
> >>> > >>> > Usually the way to deal with this is to fire scan like queries
> >>> on a
> >>> > >>> > separate thread and then consume the results in the main
> thread.
> >>> > >>> >
> >>> > >>> > Please take care of this scenario while implementation of scan
> >>> > >>> operator.
> >>> > >>> >
> >>> > >>> > -Chinmay.
> >>> > >>> >
> >>> > >>> >
> >>> > >>> > ~ Chinmay.
> >>> > >>> >
> >>> > >>> > On Tue, Dec 22, 2015 at 11:08 AM, Sandeep Deshmukh <
> >>> > >>> > [email protected]>
> >>> > >>> > wrote:
> >>> > >>> >
> >>> > >>> > > +1 for this Bhupesh.
> >>> > >>> > >
> >>> > >>> > > Additionally, I would suggest to add support for;
> >>> > >>> > > 1. Point query
> >>> > >>> > > 2. Returning any row version
> >>> > >>> > >
> >>> > >>> > > The above two are key features of HBase and should be
> >>> supported.
> >>> > >>> > >
> >>> > >>> > > Regards,
> >>> > >>> > > Sandeep
> >>> > >>> > >
> >>> > >>> > > On Fri, Dec 18, 2015 at 4:39 PM, Bhupesh Chawda <
> >>> > >>> [email protected]
> >>> > >>> > >
> >>> > >>> > > wrote:
> >>> > >>> > >
> >>> > >>> > > > Hi All,
> >>> > >>> > > >
> >>> > >>> > > > The current HBasePOJOInputOperator does not allow us to do
> >>> the
> >>> > >>> > following:
> >>> > >>> > > >
> >>> > >>> > > >    1. Allow us to specify a set of "column family: column"
> >>> and
> >>> > >>> fetch
> >>> > >>> > data
> >>> > >>> > > >    only for these columns.
> >>> > >>> > > >    2. Output format is currently a POJO. We need to have
> >>> other
> >>> > >>> output
> >>> > >>> > > >    formats such that "columnFamily:column" representation
> is
> >>> > >>> supported.
> >>> > >>> > > > Map /
> >>> > >>> > > >    CSV are some of the options.
> >>> > >>> > > >    3. Allow specifying "end row-key" to stop scanning a
> >>> table.
> >>> > >>> > > >    4. No metrics.
> >>> > >>> > > >
> >>> > >>> > > > I am planning to add the above functionality to the HBase
> >>> Input
> >>> > >>> > > operators.
> >>> > >>> > > > These features may go into the HBaseScanOperator /
> >>> > >>> > > HBasePOJOInputOperator.
> >>> > >>> > > >
> >>> > >>> > > > Please let me know your comments.
> >>> > >>> > > >
> >>> > >>> > > > Thanks.
> >>> > >>> > > >
> >>> > >>> > > > Bhupesh
> >>> > >>> > > >
> >>> > >>> > >
> >>> > >>> >
> >>> > >>>
> >>> > >>
> >>> > >>
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: Adding features to HBase Input Operators in Malhar-contrib

Reply via email to