Re: Use of the KafkaConsumer.subscribe API in the storm-kafka-client spout

Hugo Da Cruz Louro Wed, 14 Jun 2017 13:10:06 -0700

Hi, 

Flux is simply a mechanism to enabling Java objects creation using a descriptor 
file. If Flux does not support creating classes that follow the Builder design 
pattern, that is a Flux limitation and has to be fixed. It is not reasonable to 
impose that no one can write a builder because Flux does not support it. My 
suggested approach was a simple solution to quickly get around it. Let’s 
identify the proper way to fix it.


I do not think it is reasonable not to respect immutability and encapsulation 
unless there is a very strong reason to do so. It makes the code super fragile, 
hard to debug, and thread unsafe.

> On Jun 14, 2017, at 12:22 PM, Priyank Shah <ps...@hortonworks.com> wrote:
> 
> Hi Stig/Hugo,
> 
> That constructor is indeed public. I actually made that change but forgot 
> about it. 
> https://github.com/apache/storm/commit/5ff7865cf0b86f40e99b54e789fa60b8843191aa
>  The reason for making that change is to make it work with flux.
> 
> I think changing flux code to access private constructor is a hack and I 
> prefer not doing that. On the other hand, having a public constructor defeats 
> the purpose of Builder pattern since builder.build() is supposed to create 
> the object. I personally don’t think immutability of KafkaSpoutConfig is that 
> important here. I would rather get rid of the builder and have it with one or 
> two constructors with some setXXX methods. Let me know what you guys think.
> 
> 
> On 6/14/17, 9:46 AM, "Hugo Da Cruz Louro" <hlo...@hortonworks.com> wrote:
> 
>    @Harsha @Stig, I agree with you. Let’s make the de facto implementation 
> manual partition assignment. I have already adjusted the KafkaTridentSpout 
> code to reflect @Stig’s changes and things seem to be working very well for 
> Trident as well. I am tracking that on 
> https://issues.apache.org/jira/browse/STORM-2554  and I will submit a PR 
> soon. There were a couple minor fixes that I had to provide to 
> https://github.com/apache/storm/pull/2150 to make it work; I will mention 
> them as comment in the PR.
> 
>    @Priyank, the KafkaSpoutConfig class should be immutable, as it is a 
> configuration class, which should not be possible to change once it is passed 
> onto the KafkaSpout or KafkaTridentSpout. The builder that @Stig referenced 
> should indeed be private or at most package protected if needed for unit 
> tests, not public. If we have to leave it public for now to make Flux work, 
> so be it. However, the right fix for this would be to fix the Flux code to 
> work with builders. Flux uses mostly Java reflection, so the fix may be as 
> simple as allowing invocation of private constructors as described in 
> here<https://stackoverflow.com/questions/11282265/how-to-call-a-private-method-from-outside-a-java-class>.
> 
>    We should try to eliminate as many constructors possible. There should be 
> one or two constructors that enforce the dependencies that are absolutely 
> required and for which there are no reasonable defaults. Any other optional, 
> or non default, configuration setting should go in a setter method. All the 
> KafkaConsumer properties, as we seem to all agree, should be passed in a 
> Map<String, Object> which is what KafkaConsumer needs in its constructor.
> 
>    Hugo
> 
> 
>    On Jun 14, 2017, at 8:38 AM, Stig Døssing 
> <generalbas....@gmail.com<mailto:generalbas....@gmail.com>> wrote:
> 
>    It looks public to me?
>    
> https://github.com/apache/storm/blob/38e997ed96ce6627cabb4054224d7044fd2b40f9/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpoutConfig.java#L461
> 
>    I think its good to be able to avoid telescoping constructors, while at the
>    same time not having a bunch of setters on the KafkaSpoutConfig. That's the
>    main purpose I think the builder has, allowing KafkaSpoutConfig to be
>    immutable.
> 
>    I'd be happy to fiddle with it if you have an example to work from?
> 
>    2017-06-14 1:11 GMT+02:00 Priyank Shah <ps...@hortonworks.com>:
> 
>    Hi Stig,
> 
>    I think KafkaSpoutConfig constructor is private and it's throwing errors
>    while using the approach that you mentioned. Making it public defeats be
>    purpose of the builder. Can you give it a shot and confirm at your end if
>    it's possible?
> 
>    Thanks
>    Priyank
> 
>    Sent from my iPhone
> 
>    On Jun 13, 2017, at 9:36 AM, Stig Døssing <generalbas....@gmail.com>
>    wrote:
> 
>    Hi Priyank,
> 
>    I changed my mind since those mails were sent. I don't think setKey/Value
>    are very useful. They couldn't be used with the default Kafka
>    deserializers, only for deserializers implemented by the user, and then
>    only if they were declared to implement SerializableDeserializer. I agree
>    that we should remove them, and I'm not going to undo anything currently
>    in
>    the PR (unless there are objections on the PR of course)
> 
>    With regard to getting rid of the builder pattern, I think it is a pretty
>    nice pattern for Java. It looks to me like it should be possible to
>    declare
>    and configure the builder with "component:", and then pass it to the
>    KafkaSpoutConfig constructor with "ref:" after (which lets you avoid
>    calling build()). Doesn't this work?
> 
>    2017-06-12 23:32 GMT+02:00 Priyank Shah <ps...@hortonworks.com>:
> 
>    Hi Stig,
> 
>    I think PR https://github.com/apache/storm/pull/2155/files you created
>    gets rid of setKey and setValue. I am fine with it and in fact that’s
>    what
>    I was suggesting in first place. However, your last two email replies
>    suggest something else. Just making sure you are not going to undo
>    anything
>    in the PR and that we are same page about this change. i.e. no setKey or
>    setValue. Either for SerializableDeserializer implementations or Kafka
>    Deserializer interface. Only string in fqcn as a property.
> 
>    The other thing I propose is to get rid of the builder class. Reason is
>    constructing an object with builder pattern requires builder.build and
>    that
>    does work well with flux yaml. I think we should be careful about
>    implementing new connectors and make sure they work with yaml as well. I
>    have commented on the PR as well. Unless, someone else has a different
>    opinion can you address that as well?
> 
>    On 6/10/17, 2:05 AM, "Stig Døssing" <generalbas....@gmail.com> wrote:
> 
>      Priyank, I was a bit too hasty in the last response. The setKey/Value
>      functions are necessary when users want to set only the key or the
>    value
>      deserializer. I think we should keep those. It may be possible to
>      deduplicate the functionality on the API by removing the Builder
>      constructors that takes deserializers, and by getting rid of the
>      setKey/Value functions that take Class instances, since those seem
>    like a
>      duplication of the consumer config functionality. This should get rid
>    of a
>      lot of the overloads.
> 
>      2017-06-10 10:20 GMT+02:00 Stig Døssing <generalbas....@gmail.com>:
> 
>    Harsha,
> 
>    +1 for simplifying away those methods that are just setting consumer
>    config. The properties I think we should keep are all the spout
>    configuration (timeouts, retry handling, tuple construction). Maybe
>    we
>    deprecate the consumer config functions on 1.x and remove them on
>    master?
> 
>    Priyank,
> 
>    When the spout is declared, it takes type parameters to define the
>    key and
>    value types of the consumer. We are able to check at compile time
>    that the
>    deserializers match those expected types.
>    e.g.
>    SerializableDeserializer<Integer> des = ...
> 
>    KafkaSpoutConfig<Integer, String> config = KafkaSpoutConfig.builder("
>    dummy",
>    "dummy")
>              .setKey(des)
>              .build();
> 
>    KafkaSpout<String, String> wrongTypeSpout = new KafkaSpout<>(config);
> 
>    will not compile, while
> 
>    SerializableDeserializer<String> des = ...
> 
>    KafkaSpoutConfig<String, String> config = KafkaSpoutConfig.builder("
>    dummy",
>    "dummy")
>              .setKey(des)
>              .build();
> 
>    KafkaSpout<String, String> spout = new KafkaSpout<>(config);
> 
>    will. If we want to simplify the API, maybe we should just mirror the
>    KafkaConsumer API more closely and remove the Builder setKey/Value
>    methods.
>    I can't think of a reason why a user should need to create a Builder
>    of one
>    type, and then change the type later with setKey/Value. The
>    deserializers
>    can just go in through the Builder constructor.
> 
>    About KafkaTuple, I think it was done this way originally since
>    requiring
>    users to subclass KafkaTuple would be a breaking change. If we want
>    to do
>    it it should go in 2.x only. I'm not necessarily opposed to doing
>    it, but I
>    don't really have a strong opinion on it.
> 
>    Hugo,
> 
>    I appreciate that the subscribe API is a major new feature of the 0.9
>    consumer, but I can't come up with any reason to use it in Storm. I
>    don't
>    think we should support it just because it is there. As mentioned
>    upthread,
>    the features offered by that API are already covered by Storm, so
>    I'm not
>    seeing the value to having it. If we can't come up with a use case
>    for it I
>    don't see a reason to allow users to configure it, especially given
>    the
>    non-obvious problems users who choose it are likely to run into.
> 
> 
>    2017-06-10 <20%2017%2006%2010> 6:03 GMT+02:00 Harsha <
>    st...@harsha.io>:
> 
>    Dynamic assignment is what causing all the issues that we see now.
>    1. Duplicates at the start of the KafkaSpout and upon any rebalance
>    2. Trident Kafka Spout not holding the transactional batches.
>    Many corner cases can easily produce duplicates.
> 
>    There is no point in keeping dynamic assignment given all the issues
>    that are showing up.
>    Here is the excerpt from Kafka consumer docs
>    https://www-us.apache.org/dist/kafka/0.10.0.1/javadoc/org/
>    apache/kafka/clients/consumer/KafkaConsumer.html
>    "If the process itself is highly available and will be restarted if
>    it
>    fails (perhaps using a cluster management framework like YARN,
>    Mesos, or
>    AWS facilities, or as part of a stream processing framework). In
>    this
>    case there is no need for Kafka to detect the failure and reassign
>    the
>    partition since the consuming process will be restarted on another
>    machine."
> 
>    Manual assignment is the right way to go.
> 
>    -Harsha
> 
>    On Jun 9, 2017, 4:09 PM -0700, Hugo Da Cruz Louro
>    <hlo...@hortonworks.com>, wrote:
>    +1 for simplifying KafkaSpoutConfig. Too many constructors and too
>    many
>    methods.. I am not sure it’s justifiable to have any methods that
>    simply
>    set KafkaConsumer properties. All of these properties should just
>    go in
>    a Map<String, Object>, which is what KafkaConsumer receives, and
>    what
>    was supported in the initial implementation. The names of the
>    properties
>    can be retrieved from org.apache.kafka.clients.
>    consumer.ConsumerConfig.
>    At this point we may have to keep in mind backwards compatibility.
> 
>    Not sure we should completely discontinue dynamic partition
>    assignment,
>    as it is one of primary features of the new Storm Kafka Client API.
>    With
>    this said, manual partition assignment should be supported and would
>    solve a lot of potential problems arising from dynamic partition
>    assignment.
> 
>    Hugo
> 
>    On Jun 9, 2017, at 3:33 PM, Harsha <st...@harsha.io> wrote:
> 
>    I think question why we need all those settings when a user can
>    pass it
>    via Properties with consumer properties defined or via Map conf
>    object.
>    Having the methods on top of consumer config means every time Kafka
>    consumer property added or changed one needs add a builder method.
>    We
>    need to get out of the way and let the user configure it like they
>    do it
>    for typical Kafka Consumer instead we've 10s of methods that sets
>    properties for ConsumerConfig.
> 
>    Examples:
>    https://github.com/apache/storm/blob/master/external/storm-
>    kafka-client/src/main/java/org/apache/storm/kafka/spout/
>    KafkaSpoutConfig.java#L317
> 
>    https://github.com/apache/storm/blob/master/external/storm-
>    kafka-client/src/main/java/org/apache/storm/kafka/spout/
>    KafkaSpoutConfig.java#L309
>    etc.. all of these are specific to KafkaConsumer config, users
>    should
>    be able to pass it via Properties all of these.
> 
>    https://github.com/apache/storm/blob/master/external/storm-
>    kafka-client/src/main/java/org/apache/storm/kafka/spout/
>    KafkaSpoutConfig.java#L327
> 
>    whats the benefit of adding that method? and we are forcing that to
>    set
>    the protocol to "SSL" in this method
>    https://github.com/apache/storm/blob/master/external/storm-
>    kafka-client/src/main/java/org/apache/storm/kafka/spout/
>    KafkaSpoutConfig.java#L318
> 
>    Users can set the ssl properties and then can select the
>    securityProtocol "SASL_SSL" which requires both kerberos and ssl
>    configs
>    to be set. In above case making a call setSSLTruststore changes the
>    security.protocol to "SSL". This could easily run into issues if the
>    users sets securityProtocol first with "SASL_SSL" then later calls
>    setSSLTruststore which changes it to "SSL".
> 
>    We are over-taking these settings instead of letting user to figure
>    out
>    from Kafka consumer config page.
> 
>    In contrast we've KafkaProducer which does this
>    https://github.com/apache/storm/blob/master/external/storm-
>    kafka-client/src/main/java/org/apache/storm/kafka/bolt/
>    KafkaBolt.java#L121
>    . I would add Properties object instead of deriving it from
>    topologyConf
>    but this is much more easier to understand for the users. The
>    contract
>    here is put whatever the producer configs that users wants in the
>    conf
>    object and we create producer out of that config.
> 
>    Honestly these interfaces needs to be simple and let the user have
>    control instead of adding our interpretation.
> 
> 
> 
>    Thanks,
>    Harsha
>    On Jun 9, 2017, 2:08 PM -0700, Stig Døssing <
>    generalbas....@gmail.com>,
>    wrote:
>    I'd be happy with a simpler KafkaSpoutConfig, but I think most of
>    the
>    configuration parameters have good reasons for being there. Any
>    examples
>    of
>    parameters you think we should remove?
> 
>    2017-06-09 21:34 GMT+02:00 Harsha <st...@harsha.io>:
> 
>    +1 on using the manual assignment for the reasons specified below.
>    We
>    will see duplicates even in stable conditions which
>    is not good. I don’t see any reason not to switch to manual
>    assignment.
>    While we are at it we should refactor the KafkaConfig part.
>    It should be as simple as accepting the kafka consumer config or
>    properties file and forwarding it to KafkaConsumer. We made
>    it overly complex and unnecessary.
> 
>    Thanks,
>    Harsha
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Use of the KafkaConsumer.subscribe API in the storm-kafka-client spout

Reply via email to