Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/23086#discussion_r237966188
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Scan.java ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.sources.v2.SupportsBatchRead;
+import org.apache.spark.sql.sources.v2.Table;
+
+/**
+ * A logical representation of a data source scan. This interface is used
to provide logical
+ * information, like what the actual read schema is.
+ * <p>
+ * This logical representation is shared between batch scan, micro-batch
streaming scan and
+ * continuous streaming scan. Data sources must implement the
corresponding methods in this
+ * interface, to match what the table promises to support. For example,
{@link #toBatch()} must be
+ * implemented, if the {@link Table} that creates this {@link Scan}
implements
+ * {@link SupportsBatchRead}.
+ * </p>
+ */
+@Evolving
+public interface Scan {
+
+ /**
+ * Returns the actual schema of this data source scan, which may be
different from the physical
+ * schema of the underlying storage, as column pruning or other
optimizations may happen.
+ */
+ StructType readSchema();
+
+ /**
+ * A description string of this scan, which may includes information
like: what filters are
+ * configured for this scan, what's the value of some important options
like path, etc. The
+ * description doesn't need to include {@link #readSchema()}, as Spark
already knows it.
+ * <p>
+ * By default this returns the class name of the implementation. Please
override it to provide a
+ * meaningful description.
+ * </p>
+ */
+ default String description() {
--- End diff --
What about adding `pushedFilters` that defaults to `new Filter[0]`? Then
users should override that to add filters to the description, if they are
pushed. I think a Scan should be able to report its options, especially those
that distinguish it from other scans, like pushed filters.
I guess we could have some wrapper around the user-provided Scan that holds
the Scan options. I would want to standardize that instead of doing it in every
scan exec node.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]