[
https://issues.apache.org/jira/browse/BEAM-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892746#comment-15892746
]
peay edited comment on BEAM-1573 at 3/2/17 6:38 PM:
----------------------------------------------------
My concern is for both source and sink.
I'd like to be able to use custom
{{org.apache.kafka.common.serialization.Serializer,Deserializer}}. An example
is
http://docs.confluent.io/2.0.0/schema-registry/docs/serializer-formatter.html#serializer
for working with setups where Kafka topics contain Avro serialized using an
Avro schema registry. This uses a {{Serializer/Deserializer<Object>}} but I
also have similar Kafka serializers with arbitrary types.
The interfaces of the encoding/decoding methods in
{{org.apache.kafka.common.serialization.Serializer,Deserializer}} are:
- {{serialize(String topic, byte[] data)}}
- {{deserialize(String topic, byte[] data)}}.
I would like to be able to support a syntax like this:
{code}
KafkaIO
.read()
.withBootstrapServers(this.broker)
.withTopics(ImmutableList.of(this.topic))
.withCustomKafkaValueDeserializerAndCoder(new SomeCustomKafkaDeserializer(),
AvroCoder.of(xxx))
.withCustomKafkaKeyDeserializerAndCoder(new SomeCustomKafkaDeserializer()),
AvroCoder.of(xxx))
KafkaIO
.write()
.withBootstrapServers(this.broker)
.withTopic(this.topic)
.withCustomKafkaValueSerializer(new SomeCustomDeserializer())
.withCustomKafkaKeySerializer(new SomeCustomDeserializer()))
{code}
In both case, Kafka would use the custom serializer/deserializer directly.
Now, why is it hard to express currently? KafkaIO seems to be implemented
differently for read and write, so let us consider the two cases. I have a
working patch for the above syntax, that is straightforward for writes, but
requires a bunch of changes for reads...
For write, the Coder is wrapped into an actual
{{org.apache.kafka.common.serialization.Serializer}} through
{{CoderBasedKafkaSerializer}}. I can make a {{CustomCoder}}, but still have to
pass it manually the topic name. Also, we end up with a wrapper for a Kafka
serializer, wrapped in a Coder, itself wrapped in a Kafka serializer.
Reads are implemented differently. I am not sure why? Instead of wrapping the
coders into a Kafka deserializer, everything is hard wired to use `byte[]`
Kafka consumer. Then, KakfaIO calls the coder after data has been returned by
the consumer. Here also, one can make a {{CustomCoder}}. This won't work if a
list of topics is used as input to KafkaIO, and still requires to pass in the
topic name manually when there's only here. In the example snippet above, I
also include a second argument that is a coder, to use with {{setCoder}} for
setting up the rest of the pipeline.
In both cases, wrapping the Kafka serializer into the Coder is also not obvious
because Kafka serializers have a {{configure}} method to give them access to
the consumer/producer config, so this possibly needs to be emulated in the
coder wrapper.
What do you think?
was (Author: peay):
My concern is for both source and sink.
I'd like to be able to use custom
{{org.apache.kafka.common.serialization.Serializer,Deserializer}}s. An example
is
http://docs.confluent.io/2.0.0/schema-registry/docs/serializer-formatter.html#serializer
for working with setups where Kafka topics contain Avro serialized using an
Avro schema registry. This uses a {{Serializer/Deserializer<Object>}} but I
also have similar Kafka serializers with arbitrary types.
The interfaces of the encoding/decoding methods in
{{org.apache.kafka.common.serialization.Serializer,Deserializer}} are:
- {{serialize(String topic, byte[] data)}}
- {{deserialize(String topic, byte[] data)}}.
I would like to be able to support a syntax like this:
{code}
KafkaIO
.read()
.withBootstrapServers(this.broker)
.withTopics(ImmutableList.of(this.topic))
.withCustomKafkaValueDeserializerAndCoder(new SomeCustomKafkaDeserializer(),
AvroCoder.of(xxx))
.withCustomKafkaKeyDeserializerAndCoder(new SomeCustomKafkaDeserializer()),
AvroCoder.of(xxx))
KafkaIO
.write()
.withBootstrapServers(this.broker)
.withTopic(this.topic)
.withCustomKafkaValueSerializer(new SomeCustomDeserializer())
.withCustomKafkaKeySerializer(new SomeCustomDeserializer()))
{code}
In both case, Kafka would use the custom serializer/deserializer directly.
Now, why is it hard to express currently? KafkaIO seems to be implemented
differently for read and write, so let us consider the two cases. I have a
working patch for the above syntax, that is straightforward for writes, but
requires a bunch of changes for reads...
For write, the Coder is wrapped into an actual
{{org.apache.kafka.common.serialization.Serializer}} through
{{CoderBasedKafkaSerializer}}. I can make a {{CustomCoder}}, but still have to
pass it manually the topic name. Also, we end up with a wrapper for a Kafka
serializer, wrapped in a Coder, itself wrapped in a Kafka serializer.
Reads are implemented differently. I am not sure why? Instead of wrapping the
coders into a Kafka deserializer, everything is hard wired to use `byte[]`
Kafka consumer. Then, KakfaIO calls the coder after data has been returned by
the consumer. Here also, one can make a {{CustomCoder}}. This won't work if a
list of topics is used as input to KafkaIO, and still requires to pass in the
topic name manually when there's only here. In the example snippet above, I
also include a second argument that is a coder, to use with {{setCoder}} for
setting up the rest of the pipeline.
In both cases, wrapping the Kafka serializer into the Coder is also not obvious
because Kafka serializers have a {{configure}} method to give them access to
the consumer/producer config, so this possibly needs to be emulated in the
coder wrapper.
What do you think?
> KafkaIO does not allow using Kafka serializers and deserializers
> ----------------------------------------------------------------
>
> Key: BEAM-1573
> URL: https://issues.apache.org/jira/browse/BEAM-1573
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-extensions
> Affects Versions: 0.4.0, 0.5.0
> Reporter: peay
> Assignee: Raghu Angadi
> Priority: Minor
>
> KafkaIO does not allow to override the serializer and deserializer settings
> of the Kafka consumer and producers it uses internally. Instead, it allows to
> set a `Coder`, and has a simple Kafka serializer/deserializer wrapper class
> that calls the coder.
> I appreciate that allowing to use Beam coders is good and consistent with the
> rest of the system. However, is there a reason to completely disallow to use
> custom Kafka serializers instead?
> This is a limitation when working with an Avro schema registry for instance,
> which requires custom serializers. One can write a `Coder` that wraps a
> custom Kafka serializer, but that means two levels of un-necessary wrapping.
> In addition, the `Coder` abstraction is not equivalent to Kafka's
> `Serializer` which gets the topic name as input. Using a `Coder` wrapper
> would require duplicating the output topic setting in the argument to
> `KafkaIO` and when building the wrapper, which is not elegant and error prone.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)