2019-07-26 16:28:59 UTC - alphazero: Hi team. I have a general question regarding quality of service of `sink` `connectors`. Specifically in context of (remote) host failures, back pressure, retries, etc. Hugely appreciate insights and any relevant links. ---- 2019-07-26 16:33:39 UTC - David Kjerrumgaard: @alphazero It depends a bit on how much data buffering that is done inside the sink itself, e.g. do you batch up messages to do an bulk insert, etc. However, in general if the sink fails the messages, then they will be retained in the source topic so no data will be lost. Once the sink's subscription has built up a sufficient backlog the sink will stop consuming messages until the underlying issue is resolved. ---- 2019-07-26 16:36:50 UTC - alphazero: thanks @David Kjerrumgaard. background: We're in the initial exploratory phase so open to `best practices` suggestions. Our potential end-points are both `http` based connection-less, and std `protocol` e.g. `amqp`. We were planning on using the built-in `sinks` for the standard end-points. ---- 2019-07-26 16:38:30 UTC - alphazero: The actual remote system sinks are 3rd party systems not managed by us at all. We control the `source` (which happens to be `RabbitMQ`) ---- 2019-07-26 16:39:23 UTC - alphazero: So our setup (pending full migration to Pulsar if it shines as we expect it to) is `amqp` -> `pulsar-connectors` -> `3rd parties` ---- 2019-07-26 16:40:15 UTC - alphazero: At some future date we plan on removing the rabbits and using Pulsar only. ---- 2019-07-26 16:50:29 UTC - David Kjerrumgaard: @alphazero The standard built-in sinks should perform as outlined above. When interacting with 3rd party systems, system availability is out of your control, so the best you can do is to identify and react to that scenario in a reasonable fashion that ensures you don't lose data. Fortunately, you are using the correct framework for that, as those capabilities are already built into Pulsar. ---- 2019-07-26 16:52:04 UTC - David Kjerrumgaard: @alphazero From a best practices perspective, if you are writing your own sink in the future, and wish to batch up messages before sending them to the external system, then you will want to ensure that you only ack the messages AFTER you have successfully published the messages as indicated by a success response from the downstream system ---- 2019-07-26 16:54:57 UTC - alphazero: thank you @David Kjerrumgaard. so correct to assume that intermittent failures (connection drops) etc. are transparently handled by built-in connectors and our main responsibility is flow monitoring in case of backlogs? ---- 2019-07-26 16:56:40 UTC - David Kjerrumgaard: @alphazero Correct, but if you encounter different behavior please file a JIRA, etc. :smiley: ---- 2019-07-26 16:56:52 UTC - alphazero: LOL :slightly_smiling_face: will do. ---- 2019-07-26 16:58:17 UTC - alphazero: one last q @David Kjerrumgaard. Is it a bad practice to retain 'state` in these connectors? ---- 2019-07-26 17:07:00 UTC - David Kjerrumgaard: @alphazero First off, you should definitely NOT retain state inside the connector itself, i.e as a local or static variable. Since they are ephemeral for one thing, and there could be multiple instances for another reason. It would be better to use the state capabilities provided by the Pulsar Functions State API. +1 : alphazero ---- 2019-07-26 17:08:13 UTC - alphazero: Understood. And connection life-cycle. It's a bit of mystery to me how the SDK detects a dropped connection and re-instantiate the connector. Should I assume this is transparently handled by Puslar? ---- 2019-07-26 17:16:34 UTC - David Kjerrumgaard: @alphazero A dropped connection to the external system should be handled in a try/catch block inside the connectors itself. In the event of an exception, you can react accordingly, e.g. attempt to re-establish a connection, etc. However, the most important thing to do is to ensure that the message(s) are `failed` in such a scenario. This ensures they will be retained and replayed if/when the external system comes back online. +1 : alphazero ---- 2019-07-26 17:18:45 UTC - alphazero: Thank you @David Kjerrumgaard for all your input. Very helpful. /out ---- 2019-07-26 17:19:44 UTC - David Kjerrumgaard: The above scenario will not cause the connector to be stopped/restarted, etc. It is performing properly and just failing incoming messages and will continue to do so until the problem is fixed. At some point backlog quotas and message TTL comes into play, so you will need to adjust those on the source topic accordingly +1 : alphazero ---- 2019-07-26 17:21:30 UTC - David Kjerrumgaard: for production environments, I also suggest using the long weekend rule, i.e prepare to handle a scenario where the situation persists for the entire duration of a holiday weekend when your team is away and no one can address the issue until they return from the long weekend. :smiley: ---- 2019-07-26 17:27:19 UTC - alphazero: yep, thanks. The picture is much clearer now. The retention of this backlog is a domain issue that is frankly a can of worm on its own. +1 : David Kjerrumgaard ----
