Re: [DISCUSS] On HBase client retries (NIFI-6197)

Lars Francke Wed, 26 Feb 2020 05:10:45 -0800

I'm not sure either.
I agree that 35 is too high, the 2.x++ default of 15 seems reasonable to me.


We might be able to include a connection check API in the future but that
won't help us now so we need to find some way forward.

I believe we have different "services" in NiFi for HBase 1 & 2.
How about we "revert" the NiFi HBase 2 service to use the HBase default and
override the default for 1 to default to 15 as well?

On Tue, Feb 25, 2020 at 9:44 PM Josh Elser <[email protected]> wrote:

> So, do we have a real issue here? Or are we just overlapping two
> discussions about the same code-change with different context?
>
> My impression from Bryan is, now that he has the bigger context of why
> implicit retries are important for HBase, that he is not against setting
> this value higher (although, he also originally said this [1]), but just
> not the 1.x default of 35 retries (which *I* will say is too high). Am I
> reading too much into what you've said, Bryan?
>
> I understand that folks are busy and this is probably not the highest
> priority. I'm just trying to figure out if we're actually still in
> disagreement. My gut says that we're not.
>
> Re-stating goals: I think everyone is trying to make sure that the user
> of NiFi with HBase in the mix has a great user experience OOTB. That
> requires finding a balance between "fail fast when something is
> outwardly messed up" and "don't fail up when HBase can implicitly ride
> over a transient/automatically-recoverable state".
>
> [1] https://github.com/apache/nifi/pull/3425#issuecomment-487962996
>
> On 2/19/20 11:14 AM, Lars Francke wrote:
> > That's what I suggested[1] and Bryan rejected[2]
> >
> > [1] <https://github.com/apache/nifi/pull/3425#issuecomment-482711295>
> > [2] <https://github.com/apache/nifi/pull/3425#issuecomment-491984116>
> >
> >
> > On Wed, Feb 19, 2020 at 4:29 PM Josh Elser <[email protected]> wrote:
> >
> >> Thanks for sharing the code, Bryan! I lazily did not go digging.
> >>
> >> Looking into #onEnabled, we could change this to create its own
> >> Connection with a very low 1 or 2 retries with a very short retry
> >> duration. Throw that one away after we did our sanity check, set a high
> >> retry amount, and then create a new Connection for all of the
> >> puts/gets/scans/whatever the service will do.
> >>
> >> I think NiFi, in general, can err on the side of "lower" client retries
> >> (both in number and duration of backoff) than a normal client since it
> >> can implicitly buffer+retry flowfiles that fail.
> >>
> >> WDYT, Lars?
> >>
> >> On 2/19/20 9:24 AM, Bryan Bende wrote:
> >>> We do already expose the ability to configure the retries in the
> >>> controller service [1], it was just a debate about what the default
> >>> value should be. Currently it is set to 1 because we believed it was
> >>> better to fail fast during the initial configuration of the service. A
> >>> user can easily set it back to 15, or whatever HBase client normally
> >>> does. I even suggested a compromise of making the default value 7,
> >>> which was half way between the two extremes, but that was considered
> >>> unacceptable, even though based on what you said in your original
> >>> message, most cases work on the first retry, so 7 would have covered
> >>> those.
> >>>
> >>> We can also expose a property for the RPC timeout, but I think the
> >>> retries is more the issue here.
> >>>
> >>> We can also definitely improve the docs, but I would still lean
> >>> towards the default retires being something lower, and then the docs
> >>> can explain that after ensuring the service can be successfully
> >>> started that this value can be increased to 15.
> >>>
> >>> [1]
> >>
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-services/nifi-hbase_2-client-service-bundle/nifi-hbase_2-client-service/src/main/java/org/apache/nifi/hbase/HBase_2_ClientService.java#L134-L140
> >>>
> >>> On Tue, Feb 18, 2020 at 8:31 PM Josh Elser <[email protected]> wrote:
> >>>>
> >>>> We could certainly implement some kind of "sanity check" via HBase
> code,
> >>>> but I think the thing that is missing is the logical way to validate
> >>>> this in NiFi itself.
> >>>>
> >>>> Something like...
> >>>>
> >>>> ```
> >>>> Configuration conf = HBaseConfiguration.create();
> >>>> conf.setInt("hbase.rpc.timeout", 5000);
> >>>> try (Connection conn = ConnectionFactory.create(conf)) {
> >>>>      // do sanity check
> >>>> }
> >>>> Configuration conf2 = new Configuration(conf);
> >>>> conf2.setInt("hbase.rpc.timeout", 25000);
> >>>> try (Connection conn = ConnectionFactory.create(conf2)) {
> >>>>      // do real hbase-y stuff
> >>>> }
> >>>> ```
> >>>>
> >>>> Maybe instead of requiring an implicit way to do this (requiring NiFi
> >>>> code changes), we could solve the problem at the "human level": create
> >>>> docs that walk people through how to push a dummy record through the
> >>>> service/processor with the low configuration of timeouts and retries?
> >>>> Make the "sanity check" a human operation and just expose the ability
> to
> >>>> set timeout/retries via the controller service?
> >>>>
> >>>> On 2/18/20 4:36 PM, Lars Francke wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Josh, thanks for bringing it up here again.
> >>>>> I'm happy to revive the PR with whatever the outcome of this thread
> is.
> >>>>> It came up today because another client complained about how
> "unstable"
> >>>>> HBase is on NiFi.
> >>>>>
> >>>>> @Josh: As the whole issue is only the initial connect can we have a
> >>>>> different timeout setting there? I have to admit I don't know.
> >>>>>
> >>>>> Cheers,
> >>>>> Lars
> >>>>>
> >>>>> On Tue, Feb 18, 2020 at 8:11 PM Pierre Villard <
> >> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Good point, I don't think we can do that on a controller service.
> >>>>>>
> >>>>>> Le mar. 18 févr. 2020 à 11:06, Bryan Bende <[email protected]> a
> >> écrit :
> >>>>>>
> >>>>>>> That could make it a little better, but I can't remember, can we
> >>>>>>> terminate on a controller service?
> >>>>>>>
> >>>>>>> The issue here would be on first time enabling the the HBase client
> >>>>>>> service, so before even getting to a processor.
> >>>>>>>
> >>>>>>> On Tue, Feb 18, 2020 at 2:00 PM Pierre Villard
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> Bryan,
> >>>>>>>>
> >>>>>>>> I didn't follow the whole discussion so I apologize if I'm saying
> >>>>>>> something
> >>>>>>>> stupid here. Now that we have the possibility to terminate threads
> >> in a
> >>>>>>>> processor, would that solve the issue?
> >>>>>>>>
> >>>>>>>> Pierre
> >>>>>>>>
> >>>>>>>> Le mar. 18 févr. 2020 à 10:52, Bryan Bende <[email protected]> a
> >> écrit
> >>>>>> :
> >>>>>>>>
> >>>>>>>>> Hi Josh,
> >>>>>>>>>
> >>>>>>>>> The problem isn't so much about the retries within the flow, its
> >> more
> >>>>>>>>> about setting up the service for the first time.
> >>>>>>>>>
> >>>>>>>>> A common scenario for users was the following:
> >>>>>>>>>
> >>>>>>>>> - Create a new HBase client service
> >>>>>>>>> - Enter some config that wasn't quite correct, possibly hostnames
> >>>>>> that
> >>>>>>>>> weren't reachable from nifi as one example
> >>>>>>>>> - Enable service and enter retry loop
> >>>>>>>>> - Attempt to disable service to fix config, but have to wait 5+
> >> mins
> >>>>>>>>> for the retries to finish
> >>>>>>>>>
> >>>>>>>>> Maybe a lazy initialization of the connection on our side would
> >> help
> >>>>>>>>> here, although it would just be moving the problem until later
> >> (i.e.
> >>>>>>>>> service immediately enables because nothing is happening, then
> they
> >>>>>>>>> find out about config problems later when a flow file hits an
> hbase
> >>>>>>>>> processor).
> >>>>>>>>>
> >>>>>>>>> I guess the ideal scenario would be to have different logic for
> >>>>>>>>> initializing the connection vs. using it, so that there wouldn't
> be
> >>>>>>>>> retries during initialization.
> >>>>>>>>>
> >>>>>>>>> -Bryan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 18, 2020 at 1:21 PM Josh Elser <[email protected]
> >
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hiya!
> >>>>>>>>>>
> >>>>>>>>>> LarsF brought this up in the apache-hbase slack account and it
> >>>>>>> caught my
> >>>>>>>>>> eye. Sending a note here since the PR is closed where this was
> >>>>>> being
> >>>>>>>>>> discussed before[1].
> >>>>>>>>>>
> >>>>>>>>>> I understand Bryan's concerns that misconfiguration of an HBase
> >>>>>>>>>> processor with a high number of retries and back-off can create
> a
> >>>>>>>>>> situation in which the processing of a single FlowFile will
> take a
> >>>>>>> very
> >>>>>>>>>> long time to hit the onFailure state.
> >>>>>>>>>>
> >>>>>>>>>> However, as an HBase developer, I can confidently state that
> >>>>>>>>>> hbase.client.retries=1 will create scenarios in which you'll be
> >>>>>>> pushing
> >>>>>>>>>> a FlowFile through a retry loop inside of NiFi for things which
> >>>>>>> should
> >>>>>>>>>> be implicitly retried inside of the HBase client.
> >>>>>>>>>>
> >>>>>>>>>> For example, if a Region is being moved between two
> RegionServers
> >>>>>>> and an
> >>>>>>>>>> HBase processor is trying to read/write to that Region, the
> client
> >>>>>>> will
> >>>>>>>>>> see an exception. This is a "retriable" exception in
> >> HBase-parlance
> >>>>>>>>>> which means that HBase client code would automatically
> re-process
> >>>>>>> that
> >>>>>>>>>> request (looking for the new location of that Region first). In
> >>>>>> most
> >>>>>>>>>> cases, the subsequent RPC would succeed and the caller is
> >>>>>>> non-the-wiser
> >>>>>>>>>> and the whole retry logic took 1's of milliseconds.
> >>>>>>>>>>
> >>>>>>>>>> My first idea was also what Lars' had suggested -- can we come
> up
> >>>>>>> with a
> >>>>>>>>>> sanity check to validate "correct" configuration for the
> processor
> >>>>>>>>>> before we throw the waterfall of data at it? I can respect if
> >>>>>>> processors
> >>>>>>>>>> don't have a "good" hook to do such a check.
> >>>>>>>>>>
> >>>>>>>>>> What _would_ be the ideal semantics from NiFi's? perspective? We
> >>>>>> have
> >>>>>>>>>> the ability to implicitly retry operations and also control the
> >>>>>> retry
> >>>>>>>>>> backoff values. Is there something more we could do from the
> HBase
> >>>>>>> side,
> >>>>>>>>>> given what y'all have seen from the battlefield?
> >>>>>>>>>>
> >>>>>>>>>> Thanks!
> >>>>>>>>>>
> >>>>>>>>>> - Josh
> >>>>>>>>>>
> >>>>>>>>>> [1] https://github.com/apache/nifi/pull/3425
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>
> >
>

Re: [DISCUSS] On HBase client retries (NIFI-6197)

Reply via email to