[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573677#comment-15573677
 ] 

Michael Armbrust commented on SPARK-17812:
------------------------------------------

bq. with your proposed interface, what, as a user, do you expect to happen when 
you specify startingOffsets for some but not all partitions?

I would probably opt to fail to start the query with advice on how to fix it 
(i.e. "specify {{-1}} for these partitions if you don't care").  We could also 
have a default, but I tend to error on the side of explicit behavior.

bq. Yes, auto.offset.reset is a mess. Have you read 
https://issues.apache.org/jira/browse/KAFKA-3370 What are you going to do when 
that ticket is resolved? It should allow users to answer the questions you 
raised in very specific ways, that your interface does not.

There is clearly a lot of confusing baggage with this configuration option, 
specifically because it is conflating too many unrelated concerns. Furthermore, 
IMHO {{auto.offset.reset}} is a pretty confusing name that does not imply 
anything about where in the stream this query should start. "reset" implies you 
were set somewhere to begin with.

In contrast, {{startingOffsets}} handles one case clearly: it picks the offsets 
that are used as a starting point for the append only table abstraction that 
Spark is providing.

As far as I understand the discussion on the ticket you referenced, the only 
case where we lack sufficient tunability is "Where do I go when the current 
offsets are invalid due to retention?".

In this case, where data has been lost and {{failOnDataLoss=false}}, we 
currently try to minimize the amount of data we lose by starting at the 
earliest offsets available.  We should certainly consider making this behavior 
configurable as well, but that seems like a different concern than what is 
being discussed in this JIRA.

Personally, it seems like if you are falling so far behind that you have to 
skip all the way ahead, something is going very wrong.  However, if users 
request this feature, we should certainly add it. I would not, however, 
shoe-horn it into anything having to do with query start behavior. It seems 
like they have reached a similar conclusion, as they are considering adding a 
new configuration, {{auto.reset.offset.existing}}.

bq. Is the purpose of your interface to do what you think users should be able 
to do, or what they need to be able to do?

The purpose of an interface is to provide clear semantics to the user.

> More granular control of starting offsets (assign)
> --------------------------------------------------
>
>                 Key: SPARK-17812
>                 URL: https://issues.apache.org/jira/browse/SPARK-17812
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to