[
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573677#comment-15573677
]
Michael Armbrust commented on SPARK-17812:
------------------------------------------
bq. with your proposed interface, what, as a user, do you expect to happen when
you specify startingOffsets for some but not all partitions?
I would probably opt to fail to start the query with advice on how to fix it
(i.e. "specify {{-1}} for these partitions if you don't care"). We could also
have a default, but I tend to error on the side of explicit behavior.
bq. Yes, auto.offset.reset is a mess. Have you read
https://issues.apache.org/jira/browse/KAFKA-3370 What are you going to do when
that ticket is resolved? It should allow users to answer the questions you
raised in very specific ways, that your interface does not.
There is clearly a lot of confusing baggage with this configuration option,
specifically because it is conflating too many unrelated concerns. Furthermore,
IMHO {{auto.offset.reset}} is a pretty confusing name that does not imply
anything about where in the stream this query should start. "reset" implies you
were set somewhere to begin with.
In contrast, {{startingOffsets}} handles one case clearly: it picks the offsets
that are used as a starting point for the append only table abstraction that
Spark is providing.
As far as I understand the discussion on the ticket you referenced, the only
case where we lack sufficient tunability is "Where do I go when the current
offsets are invalid due to retention?".
In this case, where data has been lost and {{failOnDataLoss=false}}, we
currently try to minimize the amount of data we lose by starting at the
earliest offsets available. We should certainly consider making this behavior
configurable as well, but that seems like a different concern than what is
being discussed in this JIRA.
Personally, it seems like if you are falling so far behind that you have to
skip all the way ahead, something is going very wrong. However, if users
request this feature, we should certainly add it. I would not, however,
shoe-horn it into anything having to do with query start behavior. It seems
like they have reached a similar conclusion, as they are considering adding a
new configuration, {{auto.reset.offset.existing}}.
bq. Is the purpose of your interface to do what you think users should be able
to do, or what they need to be able to do?
The purpose of an interface is to provide clear semantics to the user.
> More granular control of starting offsets (assign)
> --------------------------------------------------
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the
> earliest or latests offsets available at the moment the query is started.
> Sometimes this is a lot of data. It would be nice to be able to do the
> following:
> - seek to user specified offsets for manually specified topicpartitions
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]