HeartSaVioR commented on a change in pull request #34333: URL: https://github.com/apache/spark/pull/34333#discussion_r739938401
########## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RatePerMicroBatchProvider.scala ########## @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.sources + +import java.util + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.connector.catalog.{SupportsRead, Table, TableCapability} +import org.apache.spark.sql.connector.read.{Scan, ScanBuilder} +import org.apache.spark.sql.connector.read.streaming.{ContinuousStream, MicroBatchStream} +import org.apache.spark.sql.internal.connector.SimpleTableProvider +import org.apache.spark.sql.sources.DataSourceRegister +import org.apache.spark.sql.types.{LongType, StructField, StructType, TimestampType} +import org.apache.spark.sql.util.CaseInsensitiveStringMap + +/** + * A source that generates increment long values with timestamps. Each generated row has two + * columns: a timestamp column for the generated time and an auto increment long column starting + * with 0L. + * + * This source supports the following options: + * - `rowsPerMicroBatch` (e.g. 100): How many rows should be generated per micro-batch. + * - `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the + * generated rows. + * - `startTimestamp` (e.g. 1000, default: 0): starting value of generated time + * - `advanceMillisPerMicroBatch` (e.g. 1000, default: 1000): the amount of time being advanced in + * generated time on each micro-batch. + * + * Unlike `rate` data source, this data source provides a consistent set of input rows per + * micro-batch regardless of query execution (configuration of trigger, query being lagging, etc.), + * say, batch 0 will produce 0~999 and batch 1 will produce 1000~1999, and so on. Same applies to + * the generated time. + * + * As the name represents, this data source only supports micro-batch read. + */ +class RatePerMicroBatchProvider extends SimpleTableProvider with DataSourceRegister { + import RatePerMicroBatchProvider._ + + override def getTable(options: CaseInsensitiveStringMap): Table = { + val rowsPerBatch = options.getLong(ROWS_PER_BATCH, 0) Review comment: No, the value in the classdoc is just an example. The option is "required" one - I just put default value of 0 here and make it fail because `getLong` requires default value. There is no good default value for the option; I see rate source taking the default row per second as 1, but I doubt about the actual usage. Same for this option - end users will have a set of inputs for micro-batches in mind when using this data source, and the value heavily depends on the test workload. Instead of have a default value which could be far from the realistic, it seems clearer that we require the option and give error message to guide end users to set it when the option is not specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
