Hi, I have opened a pull request for the changes as described in the previous emails. Here is the pull request: https://github.com/apache/incubator-apex-malhar/pull/212
Here is a short description of the changes: HBaseInputOperator - Takes care of HBaseStore and its connection. Got rid of HBaseOperatorBase. HBaseScanOperator - Takes care of scanning the table in a non-blocking manner. Exposes operationScan() and getTuple() as before. HBasePOJOInputOperator - Implements operationScan() and getTuple() and outputs a POJO on the output port. Please help review these changes. Thanks ~Bhupesh On Fri, Mar 11, 2016 at 4:42 PM, Bhupesh Chawda <[email protected]> wrote: > Hi All, > > In the current design of HBase input and output operators, the row key is > hard-coded to be of String type. > I foresee the following issue: > > - In case of numeric keys which are type casted to String, *incremental > read* is problematic. For example, after reading key = 9, we may not > be able to read any record with say, key = 8888, when though numerically > 8888 > 9, lexicographically "9" > "8888". > - This is the case only when data is being written to HBase and being > read from simultaneously. > > My suggestion is to parametrize the type of row key in the HBase input and > output operators, and let the user instantiate the required type for row > key. We can have default implementations for String and/ or Long. By > parametrizing the row key type, the user can even use complex row keys > which are a combination of multiple fields. > > Thoughts? > > PS: I understand that there is a performance concern in making a > monotonically increasing key as the row key. Given that, how do we address > the incremental read scenario? > > Thanks > > -Bhupesh > > On Wed, Dec 30, 2015 at 7:49 PM, Sandeep Deshmukh <[email protected] > > wrote: > >> Looks fine to me. >> >> Regards, >> Sandeep >> >> On Wed, Dec 30, 2015 at 7:34 PM, Bhupesh Chawda <[email protected]> >> wrote: >> >> > Here is the final hierarchy I am considering: >> > >> > HBaseInputOperator - Takes care of HBaseStore and its connection. Got >> rid >> > of HBaseOperatorBase. >> > HBaseScanOperator - Takes care of scanning the table in a >> non-blocking >> > manner. Exposes operationScan() and getTuple() as before. >> > HBasePOJOInputOperator - Implements operationScan() and >> getTuple() >> > and outputs a POJO on the output port. >> > >> > Comments? >> > >> > -Bhupesh >> > >> > >> > On Wed, Dec 30, 2015 at 2:52 PM, Bhupesh Chawda < >> [email protected]> >> > wrote: >> > >> > > The class HBaseInputOperator seems to be quite old. HBaseStore seems >> to >> > be >> > > having all the functionality provided by HBaseInputOperator and even >> more >> > > (including Kerberos authentication). >> > > >> > > It would be a good idea to avoid the usage of HBaseInputOperator going >> > > forward and use HBaseStore instead. >> > > >> > > I will also work on abstracting out the HBase input functionality in >> the >> > > HBaseInputOperator, which can be extended by concrete implementations. >> > > >> > > -Bhupesh >> > > >> > > On Wed, Dec 23, 2015 at 7:47 PM, Bhupesh Chawda < >> [email protected] >> > > >> > > wrote: >> > > >> > >> Thanks for the inputs. >> > >> As an input operator, I am targeting just the Scan operation. Get >> > >> operation may be supported better as a generic operator (like a query >> > >> operator) which I can take up later. >> > >> >> > >> -Bhupesh >> > >> >> > >> On Tue, Dec 22, 2015 at 3:48 PM, Mohit Jotwani < >> [email protected]> >> > >> wrote: >> > >> >> > >>> +1 >> > >>> >> > >>> Regards, >> > >>> Mohit >> > >>> >> > >>> On Tue, Dec 22, 2015 at 11:21 AM, Chinmay Kolhatkar < >> > >>> [email protected] >> > >>> > wrote: >> > >>> >> > >>> > +1 for above. >> > >>> > I see that there is HbaseGetOperator but but its abstract no >> concrete >> > >>> > implementation of this I can find. >> > >>> > Are you going to implement of that too? >> > >>> > >> > >>> > Maybe the concrete implementation of HbaseGetOperator should have >> > this. >> > >>> > >> > >>> > Also, I want to mention one thing about scan from my previous >> > >>> experience of >> > >>> > Hbase. The Hbase client is synchronous. >> > >>> > This means when you fire a scan call, until certain number of >> records >> > >>> are >> > >>> > received at client end, the function blocks. >> > >>> > This causes a lot of problems in the current thread as it might >> just >> > >>> get >> > >>> > blocked for a long period of time. >> > >>> > Plus, there are always network related latency to add to the >> problem. >> > >>> > >> > >>> > Usually the way to deal with this is to fire scan like queries on >> a >> > >>> > separate thread and then consume the results in the main thread. >> > >>> > >> > >>> > Please take care of this scenario while implementation of scan >> > >>> operator. >> > >>> > >> > >>> > -Chinmay. >> > >>> > >> > >>> > >> > >>> > ~ Chinmay. >> > >>> > >> > >>> > On Tue, Dec 22, 2015 at 11:08 AM, Sandeep Deshmukh < >> > >>> > [email protected]> >> > >>> > wrote: >> > >>> > >> > >>> > > +1 for this Bhupesh. >> > >>> > > >> > >>> > > Additionally, I would suggest to add support for; >> > >>> > > 1. Point query >> > >>> > > 2. Returning any row version >> > >>> > > >> > >>> > > The above two are key features of HBase and should be supported. >> > >>> > > >> > >>> > > Regards, >> > >>> > > Sandeep >> > >>> > > >> > >>> > > On Fri, Dec 18, 2015 at 4:39 PM, Bhupesh Chawda < >> > >>> [email protected] >> > >>> > > >> > >>> > > wrote: >> > >>> > > >> > >>> > > > Hi All, >> > >>> > > > >> > >>> > > > The current HBasePOJOInputOperator does not allow us to do the >> > >>> > following: >> > >>> > > > >> > >>> > > > 1. Allow us to specify a set of "column family: column" and >> > >>> fetch >> > >>> > data >> > >>> > > > only for these columns. >> > >>> > > > 2. Output format is currently a POJO. We need to have other >> > >>> output >> > >>> > > > formats such that "columnFamily:column" representation is >> > >>> supported. >> > >>> > > > Map / >> > >>> > > > CSV are some of the options. >> > >>> > > > 3. Allow specifying "end row-key" to stop scanning a table. >> > >>> > > > 4. No metrics. >> > >>> > > > >> > >>> > > > I am planning to add the above functionality to the HBase >> Input >> > >>> > > operators. >> > >>> > > > These features may go into the HBaseScanOperator / >> > >>> > > HBasePOJOInputOperator. >> > >>> > > > >> > >>> > > > Please let me know your comments. >> > >>> > > > >> > >>> > > > Thanks. >> > >>> > > > >> > >>> > > > Bhupesh >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> >> > >> >> > >> >> > > >> > >> > >
