Hey all,

Someone asked about why there is code duplication between org.apache.common
and core. The answer seemed like it might be useful to others, so including
it here:

Originally Kafka was more of a proof of concept and we didn't separate the
clients from the server. LinkedIn was much smaller and it wasn't open
source, and keeping those separate always adds a lot of overhead. So we
ended up with just one big jar.

Next thing we know the kafka jar is embedded everywhere. Lot's of fallout
from that
- It has to be really sensitive to dependencies
- Scala causes all kinds of pain for users. Ironically it causes the most
pain for people using scala because of compatibility. I think the single
biggest Kafka complaint was the scala clients and resulting scary
exceptions, lack of javadoc, etc.
- Many of the client interfaces weren't well thought out as permanent
long-term commitments.
- We new we had to rewrite both clients due to technical deficiencies
anyway. The clients really needed to move to non-blocking I/O which is
basically a rewrite on it's own.

So how to go about that?

Well we felt we needed to maintain the old client interfaces for a good
period of time. Any kind of breaking cut-over was kind of a non-starter.
But a major refactoring in place was really hard since so many classes were
public and so little attention had been paid to the difference between
public and private classes.

Naturally since the client and server do the inverse of each other there is
a ton of shared logic. So we thought we needed to break it up into three
independent chunks:
1. common - shared helper code used by both clients and server
2. clients - the producer, consumer, and eventually admin java interfaces.
This depends on common.
3. server - the server (and legacy clients). This is currently called core.
This will depend on common and clients (because sometimes the server needs
to make client requests)

Common and clients were left as a single jar and just logically separate so
that people wouldn't have to deal with two jars (and hence the possibility
of getting different versions of each).

The dependency is actually a little counter-intuitive to people--they
usually think of the client as depending on the server since the client
calls the server. But in terms of code dependencies it is the other way--if
you depend on the client you obviously don't want to drag in the server.

So to get all this done we decided to just go big and do a rewrite of the
clients in Java. A result of this is that any shared code would have to
move to Java (so the clients don't pull in Scala). We felt this was
probably a good thing in its own right as it gave a chance to improve a few
of these utility libraries like config parsing, etc.

So the plan was and is:
1. Rewrite producer, release and roll out
2a. Rewrite consumer, release and roll out
2b. Migrate server from scala code to org.apache.common classes
3. Deprecate scala clients

(2a) Is is in flight now, and that means (2b) is totally up for grabs. Of
these the request conversion is definitely the most pressing since having
those defined twice duplicates a ton of work. We will have to be
hyper-conscientious during the conversion about making the shared code in
common really solve the problem well and conveniently on the server as well
(so we don't end up just shoe-horning it in). My hope is that we can treat
this common code really well--it isn't as permanent as the public classes
but ends up heavily used so we should take good care of it. Most the shared
code is private so we can refactor the stuff in common to meet the needs of
the server if we find mismatches or missing functionality. I tried to keep
in mind the eventual server usage while writing it, but I doubt it will be
as trivial as just deleting the old and adding the new.

In terms of the simplicity:
- Converting exceptions should be trivial
- Converting utils is straight-forward but we should evaluate the
individual utilities and see if they actually make sense, have tests, are
used, etc.
- Converting the requests may not be too complex but touches a huge hunk of
code and may require some effort to decouple the network layer.
- Converting the network code will be delicate and may require some changes
in org.apache.common.network to meet the server's needs

This is all a lot of work, but if we stick to it at the end we will have
really nice clients and a nice modular code base. :-)

Cheers,

-Jay

Reply via email to