[
https://issues.apache.org/jira/browse/SAMZA-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jake Maes updated SAMZA-1392:
-----------------------------
Attachment: Producer Performance Tests for SAMZA-1392 - Sheet1.pdf
Attaching some performance test results for the patch.
The results look pretty good across the board. As expected, the biggest
improvement was with compression and multithreading enabled, but it's slightly
faster all around. The fix to the flush logic (first bullet point in the list
below) should have slowed us down a bit, so that must have been offset by the
other improvements in the change.
I’m a little snowblind from looking at the spreadsheet for the past few days,
so if you spot any anomalies, let me know. :-)
Background:
In short, we wanted to remove the sendLock that we use to get the latestFuture
because the KafkaProducer does compression and serialization in the user
thread, which causes contention on that lock when multithreading is enabled.
Recall that we were tracking the latestFuture and waiting on it for flush().
That code was written at a time before Kafka's flush() was available/working.
Since the Kafka flush() operation is now available in all the versions
supported by Samza, the plan was to switch to that API. With that model, we can
flush without tracking the latestFuture which means we can remove the sendLock.
Along the way, I found and fixed a couple data loss issues which mostly stem
from our synchronous SystemProducer API trying to handle asynchronous error
scenarios with multithreading enabled.
Summary of the improvements:
* Fixed a bug in flush() logic that waits on the latest Future per task rather
than per partition, which could lead to data loss.
* Fixed a bug with exception handling when task.async.commit=true that could
cause data loss.
* Significantly improved send() performance with multithreading by removing the
sendLock in favor of the KafkaProducer flush() method. (see performance results
in google sheet)
* Improved availability guarantees by introducing task.drop.producer.errors,
which guarantees that a container will never fail due to async producer errors.
This fixes the case where the application wants to swallow producer exceptions,
but can't do so when they occur in flush(), so they still get occasional
container failures.
* Code is much more simple now and therefore easier to maintain.
* More thorough unit tests added.
* Potentially better batching in low throughput scenarios now because
linger.ms=10 by default.
> KafkaSystemProducer performance and correctness with concurrent sends and
> flushes
> ---------------------------------------------------------------------------------
>
> Key: SAMZA-1392
> URL: https://issues.apache.org/jira/browse/SAMZA-1392
> Project: Samza
> Issue Type: Bug
> Reporter: Jake Maes
> Assignee: Jake Maes
> Fix For: 0.14.0
>
> Attachments: Producer Performance Tests for SAMZA-1392 - Sheet1.pdf
>
>
> There are 2 issues we need to fix in the KafkaSystemProducer when sends and
> flushes are called concurrently:
> 1. Concurrent sends contend for the sendlock, especially when producer
> compression is enabled. The fix is to use the producer.flush() API, which
> kafka has supported since at least version 0.9.x. This way we won't need to
> track the latest future, so we won't need the lock.
> 2. When task.async.commit is enabled, the threads calling send() could set
> the exceptionInCallback to null before the exception is handled in user code
> or flush(). This could allow us to checkpoint offsets for which the
> corresponding output was not successfully sent.
> The short term solution here is to only handle the callback exceptions from
> flush() and allow users to configure the exceptions as ignorable in case they
> don't want flush to fail.
> The long term solution is to support a fully asynchronous SystemProducer.
> Ticket SAMZA-1393.
> I found issue #2 while working on issue #1, so while they're separate issues,
> it's easier to fix them with one ticket/patch.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)