cloud-fan commented on code in PR #54946:
URL: https://github.com/apache/spark/pull/54946#discussion_r3462721230


##########
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:
##########
@@ -255,30 +258,44 @@ private[hive] object SparkSQLCLIDriver extends Logging {
     def continuedPromptWithDBSpaces: String = continuedPrompt + 
ReflectionUtils.invokeStatic(
       classOf[CliDriver], "spacesForString", classOf[String] -> currentDB)
 
+    val sqlParser = SparkSQLEnv.sparkSession.sessionState.sqlParser
     var currentPrompt = promptWithCurrentDB
     var line = reader.readLine(currentPrompt + "> ")
 
     while (line != null) {
       // SPARK-55198: call line.trim to also skip comment line with leading 
whitespaces,
       // this keeps the behavior align with HIVE-8396
       if (!line.trim.startsWith("--")) {
-        if (prefix.nonEmpty) {
-          prefix += '\n'
-        }
-
-        if (line.trim().endsWith(";") && !line.trim().endsWith("\\;")) {
-          line = prefix + line
-          ret = cli.processLine(line, true)
-          prefix = ""
-          currentPrompt = promptWithCurrentDB
-        } else {
-          prefix = prefix + line
-          currentPrompt = continuedPromptWithDBSpaces
+        val candidate = if (buffer.isEmpty) line else buffer + "\n" + line

Review Comment:
   The rewritten loop drops the old `!line.trim().endsWith("\\;")` 
line-continuation special-case. With the new splitter a stray `\` lexes as a 
default-channel `UNRECOGNIZED` token, while `processCmd` still has the 
`oneCmd.endsWith("\\")` continuation branch (which only accumulates — never 
executes — when it's the trailing chunk). Could you confirm the intended `\;` 
behavior and add a `CliSuite` case for it (or note it's intentionally removed)? 
It's currently uncovered. Non-blocking.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserInterface.scala:
##########
@@ -80,4 +80,24 @@ trait ParserInterface extends DataTypeParserInterface {
    */
   @throws[ParseException]("Text cannot be parsed to routine parameters")
   def parseRoutineParam(sqlText: String): StructType
+
+  /**
+   * Split a SQL string into individual statements at `;` boundaries.
+   *
+   * Designed for tooling such as the `spark-sql` CLI that needs to feed 
multiple
+   * statements to the parser one at a time, while correctly handling quoted
+   * strings, single-line and bracketed comments, and SQL scripting compound 
blocks
+   * (`BEGIN ... END`) so that semicolons inside them do not split the 
surrounding
+   * statement.
+   *
+   * The method is fault-tolerant: it does not throw on incomplete or malformed
+   * input. Trailing text that does not yet form a complete statement is 
returned
+   * in [[SqlStatementSplitResult.partialStatement]] so callers can buffer it 
and
+   * read more input.
+   *
+   * Implementations are expected to apply the same input preprocessing 
(variable
+   * substitution, etc.) as their `parsePlan` so that the splitter sees the 
same
+   * stream of tokens the parser would.
+   */
+  def splitStatements(sqlText: String): SqlStatementSplitResult

Review Comment:
   If split-time substitution is dropped (see the `SparkSqlParser` comment), 
this method has no per-implementation behavior left — the splitter always 
validates against the vanilla `SqlBaseParser`, so an injected/extension parser 
can't customize splitting anyway. At that point this new `@DeveloperApi` 
abstract method (plus the `AbstractSqlParser` default, the `SparkSqlParser` 
override, and the two test-decorator overrides) could be removed and the CLI 
could call `SqlStatementSplitter.split` directly. Worth considering to keep the 
public API surface minimal — non-blocking.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -1206,4 +1206,31 @@ class SparkSqlParserSuite extends AnalysisTest with 
SharedSparkSession {
     parser.parsePlan(
       "SELECT CAST(null AS STRUCT<>), CAST(null AS MAP<STRING, ARRAY<INT>>), 2 
>> 1")
   }
+
+  test("splitStatements applies variable substitution before splitting") {

Review Comment:
   This pins the up-front-substitution behavior but doesn't cover 
`SET`-then-`${...}` ordering across statements in one batch — the case flagged 
on `SparkSqlParser.splitStatements`. If that finding is addressed, please add a 
batch test asserting a later statement sees the value set by an earlier `SET`. 
Non-blocking.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/SqlStatementSplitter.scala:
##########
@@ -0,0 +1,385 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.parser
+
+import java.util.{ArrayList => JArrayList}
+
+import scala.collection.mutable
+
+import org.antlr.v4.runtime._
+import org.antlr.v4.runtime.atn.PredictionMode
+import org.antlr.v4.runtime.misc.ParseCancellationException
+
+import org.apache.spark.sql.internal.SqlApiConf
+
+/**
+ * Represents a single complete SQL statement together with the delimiter that
+ * terminated it (always `";"` for now).
+ *
+ * @param statement   the SQL statement text, with surrounding whitespace 
trimmed and
+ *                    without the terminator
+ * @param terminator  the delimiter string that terminated the statement
+ */
+case class SqlStatement(statement: String, terminator: String) {
+  override def toString: String = statement + terminator
+}
+
+/**
+ * Result of splitting a SQL string into individual statements.
+ *
+ * @param completeStatements  statements that are fully terminated by `;`
+ * @param partialStatement    trailing text after the last `;` that has not yet
+ *                            formed a complete statement; an empty string 
when the
+ *                            input ends with `;` or contains no significant
+ *                            trailing content
+ * @param hasUnclosedComment  true when [[partialStatement]] contains an 
unclosed
+ *                            bracketed comment (`/* ...` with no matching 
`*/`).
+ *                            Interactive CLIs may want to flush the partial 
to the
+ *                            backend in this case (so the user sees a parse 
error)
+ *                            rather than keep buffering, since the input 
cannot be
+ *                            completed simply by appending more SQL.
+ */
+case class SqlStatementSplitResult(
+    completeStatements: Seq[SqlStatement],
+    partialStatement: String,
+    hasUnclosedComment: Boolean = false) {
+  def isEmpty: Boolean = completeStatements.isEmpty && partialStatement.isEmpty
+}
+
+/**
+ * A parser-based SQL statement splitter, inspired by Trino's
+ * `io.trino.cli.lexer.StatementSplitter`.
+ *
+ * Each candidate statement is consumed and confirmed by the ANTLR-generated
+ * [[SqlBaseParser]] via the existing `compoundOrSingleStatement` rule (the 
same
+ * rule that the normal Spark SQL parser uses). The splitter:
+ *
+ *   1. Tokenizes the input once.
+ *   2. Walks through the token stream. At each significant position, the
+ *      splitter asks the parser whether the prefix ending at the next `;` is a
+ *      complete statement; if not, it extends the prefix to the next `;` and
+ *      re-tries. This is how SQL scripting `BEGIN ... END` blocks (whose body
+ *      contains semicolons) end up emitted as a single statement: only the
+ *      prefix that includes a matching `END` is accepted by the parser.
+ *   3. When the parser fails because it reached EOF mid-rule (e.g. an
+ *      un-terminated `BEGIN ... END`, a SELECT with a missing operand), the
+ *      remaining input is treated as a partial statement so an interactive
+ *      caller can keep buffering.
+ *   4. When the parser fails on a non-EOF token (the input is structurally
+ *      invalid), the splitter falls back to splitting at the next `;` so the
+ *      surrounding delimiters still emit chunks and the backend can report
+ *      the error per chunk.
+ *
+ * Quoted strings, single-line and bracketed (nested) comments are honored
+ * throughout. An unterminated bracketed comment is surfaced via
+ * [[SqlStatementSplitResult.hasUnclosedComment]].
+ */
+object SqlStatementSplitter {
+
+  /** Split the given SQL text into individual statements at `;` boundaries. */
+  def split(sqlText: String): SqlStatementSplitResult = {
+    require(sqlText != null, "sqlText must not be null")
+
+    val lexer = new SqlBaseLexer(new 
UpperCaseCharStream(CharStreams.fromString(sqlText)))
+    lexer.removeErrorListeners()
+    val tokenStream = new CommonTokenStream(lexer)
+    tokenStream.fill()
+
+    val numTokens = tokenStream.size()
+    // Pre-compute the positions of `;` tokens (on the default channel).
+    val delimiterPositions: Array[Int] = {
+      val acc = mutable.ArrayBuffer.empty[Int]
+      var i = 0
+      while (i < numTokens) {
+        if (tokenStream.get(i).getType == SqlBaseLexer.SEMICOLON) acc += i
+        i += 1
+      }
+      acc.toArray
+    }
+
+    val completeStatements = mutable.ArrayBuffer.empty[SqlStatement]
+    val buffer = new StringBuilder()
+    // Whether `buffer` contains any non-hidden token (i.e. any actual SQL 
content
+    // beyond whitespace and comments). Chunks that only contain 
whitespace/comments
+    // are dropped, matching the spark-sql CLI's long-standing behavior.
+    var bufferHasContent = false
+    var index = 0
+    var stopOuter = false
+    // The first index in `delimiterPositions` that is still > our cursor.
+    var delimSearchStart = 0
+
+    // Snapshot the session config once -- the splitter is short-lived and the
+    // splitter's parser must agree with the session config on grammar
+    // interpretation (e.g. `double_quoted_identifiers`).
+    val conf = SqlApiConf.get
+
+    while (!stopOuter && index < numTokens) {
+      val startIdx = nextSignificantTokenIndex(tokenStream, index)
+      if (startIdx < 0) {
+        // Just hidden trailing tokens (whitespace / closed comments). Drain
+        // them into the buffer; if any are an unclosed comment, the lexer
+        // flag will surface it as a partial.
+        while (index < numTokens) {
+          val tok = tokenStream.get(index)
+          index += 1
+          if (tok.getType != Token.EOF) buffer.append(tok.getText)
+        }
+        stopOuter = true
+      } else if (tokenStream.get(startIdx).getType == SqlBaseLexer.SEMICOLON) {
+        // The next significant token is itself a `;`. This is an empty
+        // statement (e.g. `;;` or leading `;`); drop everything from the
+        // cursor through this delimiter and continue.
+        index = startIdx + 1
+      } else {
+        // Advance the search for delimiters past our current cursor.
+        while (delimSearchStart < delimiterPositions.length &&
+            delimiterPositions(delimSearchStart) <= startIdx) {
+          delimSearchStart += 1
+        }
+
+        // Try increasingly long prefixes (each ending at a `;`) until the

Review Comment:
   Boundary detection here re-parses a *growing* prefix from the block start 
for each internal `;`, because `compoundOrSingleStatement` is `EOF`-terminated 
and can only answer "is this whole slice one statement?". So a `BEGIN ... END` 
block with k internal `;` costs ~k re-parses (O(k²)), and an incomplete block 
re-tries every delimiter on each interactive keystroke. Ordinary non-scripting 
SQL is linear and the cost is dwarfed by query execution, so this isn't a 
correctness issue — but an uncommon large/incomplete block is a real latency 
cliff.
   
   It can be made O(n) without changing the overall approach: parse each region 
*once* with a single-statement entry that is **not** `EOF`-terminated, then 
read where it stopped, instead of growing prefixes. The grammar already has the 
sub-rules — `beginEndCompoundBlock` (no EOF), `statement`, `setResetStatement` 
— so a small
   
   ```
   parseOneStatement : beginEndCompoundBlock | statement | setResetStatement ;
   ```
   
   lets you parse the remaining token stream once, take 
`ctx.getStop().getTokenIndex()` as the boundary, consume the trailing `;`, and 
continue. A top-level `statement` can't consume a `;`, so the parser stops at 
the boundary deterministically; the incomplete-vs-invalid (`LA(1) == EOF`) 
classification and the extension/invalid fallback are unchanged. Non-blocking, 
but worth doing since the worst case is user-visible latency.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala:
##########
@@ -103,6 +103,22 @@ class SparkSqlParser extends AbstractSqlParser {
     parseInternal(command, None)(toResult)
   }
 
+  /**
+   * Split a SQL string into individual statements after expanding any `${...}`
+   * variable references. Variable substitution has to happen *before* 
splitting
+   * because a substituted value may itself contain `;`, comments, or
+   * `BEGIN ... END` structure that affect statement boundaries.
+   *
+   * Parameter substitution is intentionally NOT applied here: the splitter
+   * runs at the top level of an interactive session / batch input, where there
+   * is no parameter context bound. If a caller does have a parameter context,
+   * they should pre-substitute the input and call this with the result.
+   */
+  override def splitStatements(sqlText: String): SqlStatementSplitResult = {
+    val variableSubstituted = substitutor.substitute(sqlText)

Review Comment:
   Substituting the whole input here and returning statements carved from the 
*substituted* text means the CLI substitutes the entire `-e`/`-f` batch before 
any statement executes. Because `ConfigReader` resolves config defaults 
eagerly, a `${conf}` whose value is changed by an earlier `SET` in the same 
batch resolves to the pre-`SET` / default value:
   
   ```
   SET spark.sql.shuffle.partitions=99;
   SELECT '${spark.sql.shuffle.partitions}' AS p;   -- yields 200, not 99
   ```
   
   The replaced `splitSemiColon` path substituted per-statement at parse time 
(`parseInternal`, after the `SET` ran), so this is a regression. Suggest 
splitting the raw text (i.e. drop this override and use the `AbstractSqlParser` 
default) and letting each statement substitute at parse — the old path never 
split on `;` inside a substituted value either, so no capability is lost. 
Interactive mode is unaffected since statements execute between splits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to