Dear Community, Can anyone help review the pull request: https://github.com/apache/incubator-apex-malhar/pull/212
Thanks. ~Bhupesh On Thu, Mar 17, 2016 at 4:16 PM, Bhupesh Chawda <[email protected]> wrote: > Hi, > > I have opened a pull request for the changes as described in the previous > emails. Here is the pull request: > https://github.com/apache/incubator-apex-malhar/pull/212 > > Here is a short description of the changes: > > HBaseInputOperator - Takes care of HBaseStore and its connection. Got rid > of HBaseOperatorBase. > HBaseScanOperator - Takes care of scanning the table in a non-blocking > manner. Exposes operationScan() and getTuple() as before. > HBasePOJOInputOperator - Implements operationScan() and getTuple() and > outputs a POJO on the output port. > > Please help review these changes. > > Thanks > ~Bhupesh > > On Fri, Mar 11, 2016 at 4:42 PM, Bhupesh Chawda <[email protected]> > wrote: > >> Hi All, >> >> In the current design of HBase input and output operators, the row key is >> hard-coded to be of String type. >> I foresee the following issue: >> >> - In case of numeric keys which are type casted to String, *incremental >> read* is problematic. For example, after reading key = 9, we may not >> be able to read any record with say, key = 8888, when though numerically >> 8888 > 9, lexicographically "9" > "8888". >> - This is the case only when data is being written to HBase and being >> read from simultaneously. >> >> My suggestion is to parametrize the type of row key in the HBase input >> and output operators, and let the user instantiate the required type for >> row key. We can have default implementations for String and/ or Long. By >> parametrizing the row key type, the user can even use complex row keys >> which are a combination of multiple fields. >> >> Thoughts? >> >> PS: I understand that there is a performance concern in making a >> monotonically increasing key as the row key. Given that, how do we address >> the incremental read scenario? >> >> Thanks >> >> -Bhupesh >> >> On Wed, Dec 30, 2015 at 7:49 PM, Sandeep Deshmukh < >> [email protected]> wrote: >> >>> Looks fine to me. >>> >>> Regards, >>> Sandeep >>> >>> On Wed, Dec 30, 2015 at 7:34 PM, Bhupesh Chawda <[email protected] >>> > >>> wrote: >>> >>> > Here is the final hierarchy I am considering: >>> > >>> > HBaseInputOperator - Takes care of HBaseStore and its connection. Got >>> rid >>> > of HBaseOperatorBase. >>> > HBaseScanOperator - Takes care of scanning the table in a >>> non-blocking >>> > manner. Exposes operationScan() and getTuple() as before. >>> > HBasePOJOInputOperator - Implements operationScan() and >>> getTuple() >>> > and outputs a POJO on the output port. >>> > >>> > Comments? >>> > >>> > -Bhupesh >>> > >>> > >>> > On Wed, Dec 30, 2015 at 2:52 PM, Bhupesh Chawda < >>> [email protected]> >>> > wrote: >>> > >>> > > The class HBaseInputOperator seems to be quite old. HBaseStore seems >>> to >>> > be >>> > > having all the functionality provided by HBaseInputOperator and even >>> more >>> > > (including Kerberos authentication). >>> > > >>> > > It would be a good idea to avoid the usage of HBaseInputOperator >>> going >>> > > forward and use HBaseStore instead. >>> > > >>> > > I will also work on abstracting out the HBase input functionality in >>> the >>> > > HBaseInputOperator, which can be extended by concrete >>> implementations. >>> > > >>> > > -Bhupesh >>> > > >>> > > On Wed, Dec 23, 2015 at 7:47 PM, Bhupesh Chawda < >>> [email protected] >>> > > >>> > > wrote: >>> > > >>> > >> Thanks for the inputs. >>> > >> As an input operator, I am targeting just the Scan operation. Get >>> > >> operation may be supported better as a generic operator (like a >>> query >>> > >> operator) which I can take up later. >>> > >> >>> > >> -Bhupesh >>> > >> >>> > >> On Tue, Dec 22, 2015 at 3:48 PM, Mohit Jotwani < >>> [email protected]> >>> > >> wrote: >>> > >> >>> > >>> +1 >>> > >>> >>> > >>> Regards, >>> > >>> Mohit >>> > >>> >>> > >>> On Tue, Dec 22, 2015 at 11:21 AM, Chinmay Kolhatkar < >>> > >>> [email protected] >>> > >>> > wrote: >>> > >>> >>> > >>> > +1 for above. >>> > >>> > I see that there is HbaseGetOperator but but its abstract no >>> concrete >>> > >>> > implementation of this I can find. >>> > >>> > Are you going to implement of that too? >>> > >>> > >>> > >>> > Maybe the concrete implementation of HbaseGetOperator should have >>> > this. >>> > >>> > >>> > >>> > Also, I want to mention one thing about scan from my previous >>> > >>> experience of >>> > >>> > Hbase. The Hbase client is synchronous. >>> > >>> > This means when you fire a scan call, until certain number of >>> records >>> > >>> are >>> > >>> > received at client end, the function blocks. >>> > >>> > This causes a lot of problems in the current thread as it might >>> just >>> > >>> get >>> > >>> > blocked for a long period of time. >>> > >>> > Plus, there are always network related latency to add to the >>> problem. >>> > >>> > >>> > >>> > Usually the way to deal with this is to fire scan like queries >>> on a >>> > >>> > separate thread and then consume the results in the main thread. >>> > >>> > >>> > >>> > Please take care of this scenario while implementation of scan >>> > >>> operator. >>> > >>> > >>> > >>> > -Chinmay. >>> > >>> > >>> > >>> > >>> > >>> > ~ Chinmay. >>> > >>> > >>> > >>> > On Tue, Dec 22, 2015 at 11:08 AM, Sandeep Deshmukh < >>> > >>> > [email protected]> >>> > >>> > wrote: >>> > >>> > >>> > >>> > > +1 for this Bhupesh. >>> > >>> > > >>> > >>> > > Additionally, I would suggest to add support for; >>> > >>> > > 1. Point query >>> > >>> > > 2. Returning any row version >>> > >>> > > >>> > >>> > > The above two are key features of HBase and should be >>> supported. >>> > >>> > > >>> > >>> > > Regards, >>> > >>> > > Sandeep >>> > >>> > > >>> > >>> > > On Fri, Dec 18, 2015 at 4:39 PM, Bhupesh Chawda < >>> > >>> [email protected] >>> > >>> > > >>> > >>> > > wrote: >>> > >>> > > >>> > >>> > > > Hi All, >>> > >>> > > > >>> > >>> > > > The current HBasePOJOInputOperator does not allow us to do >>> the >>> > >>> > following: >>> > >>> > > > >>> > >>> > > > 1. Allow us to specify a set of "column family: column" >>> and >>> > >>> fetch >>> > >>> > data >>> > >>> > > > only for these columns. >>> > >>> > > > 2. Output format is currently a POJO. We need to have >>> other >>> > >>> output >>> > >>> > > > formats such that "columnFamily:column" representation is >>> > >>> supported. >>> > >>> > > > Map / >>> > >>> > > > CSV are some of the options. >>> > >>> > > > 3. Allow specifying "end row-key" to stop scanning a >>> table. >>> > >>> > > > 4. No metrics. >>> > >>> > > > >>> > >>> > > > I am planning to add the above functionality to the HBase >>> Input >>> > >>> > > operators. >>> > >>> > > > These features may go into the HBaseScanOperator / >>> > >>> > > HBasePOJOInputOperator. >>> > >>> > > > >>> > >>> > > > Please let me know your comments. >>> > >>> > > > >>> > >>> > > > Thanks. >>> > >>> > > > >>> > >>> > > > Bhupesh >>> > >>> > > > >>> > >>> > > >>> > >>> > >>> > >>> >>> > >> >>> > >> >>> > > >>> > >>> >> >> >
