[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573677#comment-15573677 ]
Michael Armbrust commented on SPARK-17812: ------------------------------------------ bq. with your proposed interface, what, as a user, do you expect to happen when you specify startingOffsets for some but not all partitions? I would probably opt to fail to start the query with advice on how to fix it (i.e. "specify {{-1}} for these partitions if you don't care"). We could also have a default, but I tend to error on the side of explicit behavior. bq. Yes, auto.offset.reset is a mess. Have you read https://issues.apache.org/jira/browse/KAFKA-3370 What are you going to do when that ticket is resolved? It should allow users to answer the questions you raised in very specific ways, that your interface does not. There is clearly a lot of confusing baggage with this configuration option, specifically because it is conflating too many unrelated concerns. Furthermore, IMHO {{auto.offset.reset}} is a pretty confusing name that does not imply anything about where in the stream this query should start. "reset" implies you were set somewhere to begin with. In contrast, {{startingOffsets}} handles one case clearly: it picks the offsets that are used as a starting point for the append only table abstraction that Spark is providing. As far as I understand the discussion on the ticket you referenced, the only case where we lack sufficient tunability is "Where do I go when the current offsets are invalid due to retention?". In this case, where data has been lost and {{failOnDataLoss=false}}, we currently try to minimize the amount of data we lose by starting at the earliest offsets available. We should certainly consider making this behavior configurable as well, but that seems like a different concern than what is being discussed in this JIRA. Personally, it seems like if you are falling so far behind that you have to skip all the way ahead, something is going very wrong. However, if users request this feature, we should certainly add it. I would not, however, shoe-horn it into anything having to do with query start behavior. It seems like they have reached a similar conclusion, as they are considering adding a new configuration, {{auto.reset.offset.existing}}. bq. Is the purpose of your interface to do what you think users should be able to do, or what they need to be able to do? The purpose of an interface is to provide clear semantics to the user. > More granular control of starting offsets (assign) > -------------------------------------------------- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org