[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

jose-torres Wed, 08 Aug 2018 09:10:37 -0700

Github user jose-torres commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r208642760
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/MicroBatchReadSupport.java
 ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader.streaming;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.execution.streaming.BaseStreamingSource;
    +import org.apache.spark.sql.sources.v2.reader.ScanConfig;
    +import org.apache.spark.sql.sources.v2.reader.ScanConfigBuilder;
    +
    +/**
    + * An interface which defines how to scan the data from data source for 
streaming processing with
    + * micro-batch mode.
    + */
    [email protected]
    +public interface MicroBatchReadSupport extends StreamingReadSupport, 
BaseStreamingSource {
    +
    +  /**
    +   * Returns a builder of {@link ScanConfig}. The builder can take some 
query specific information
    +   * like which operators to pushdown, streaming offsets, etc., and keep 
these information in the
    +   * created {@link ScanConfig}.
    +   *
    +   * This is the first step of the data scan. All other methods in {@link 
MicroBatchReadSupport}
    +   * needs to take {@link ScanConfig} as an input.
    +   *
    +   * If this method fails (by throwing an exception), the action will fail 
and no Spark job will be
    +   * submitted.
    +   */
    +  ScanConfigBuilder newScanConfigBuilder(Offset start, Offset end);
    +
    +  /**
    +   * Returns the most recent offset available.
    +   */
    +  Offset latestOffset(Offset start);
    --- End diff --
    
    There's a weak form of rate control implemented by simply having sources 
lie about what the latest offset is. For example you might set 
maxOffsetsPerTrigger = 100, and then the Kafka source will pretend that only 
100 more offsets exist even if there are really more available.
    
    Unfortunately, we're going to need to continue to support such options at 
least until the next major version after we have better rate limiting, so I 
don't think this can be removed from the source API right now.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to