Github user marmbrus commented on the issue:
https://github.com/apache/spark/pull/15102
> My bigger concern is that it looks like you guys are continuing to hack
in a particular direction, without addressing my points or answering whether
you're willing to let me help work on this.
Have you made up your mind?
Cody, I think we have been addressing your points, though I know we are not
done yet. It would be helpful if you could make specific comments on the code,
preferably with pointers to what you think the correct implementation would
look like. Otherwise its hard to track which points you think have been
resolved and which are still in question.
I appreciate that you are concerned that some of this code is duplicated,
but I'm going to have to respectfully disagree on that point. I think this is
the right choice both for the stability of the DStream implementation and our
ability to optimize the SQL implementation.
> You should not be assuming 0 for a starting offset for partitions you've
just learned about. You should be asking the underlying driver consumer what
its position is.
I'll let Ryan comment further here, but I'm not sure if this is correct.
It sounds like if we rely on Kafka to manage the its position there will be
cases where partial failure could result it data loss. In general, I think we
need to be careful about relying on Kafka internals when our end goal is to
provide a much [higher level
abstraction](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#programming-model).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]