hudi-agent commented on code in PR #18797:
URL: https://github.com/apache/hudi/pull/18797#discussion_r3307258043


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala:
##########
@@ -232,25 +250,50 @@ object HoodieVectorSearchPlanBuilder {
  * and select top-K per query. The cross-join produces O(|corpus| * |queries|)
  * intermediate rows, so this is suitable for small-to-medium query sets
  * (tens to low hundreds of queries) against moderate corpora.
+ *
+ * <p>Both modes support an optional {@code filter} predicate (applied to the 
corpus before
+ * distance computation), and an optional {@code maxDistance} threshold 
(results beyond this
+ * distance are excluded before top-K selection, reducing shuffle and sort 
volume).
  */
 object BruteForceSearchAlgorithm extends VectorSearchAlgorithm {
 
   import HoodieVectorSearchPlanBuilder._
 
   override val name: String = "brute_force"
 
+  /**
+   * Applies a user-supplied SQL predicate to the corpus DataFrame, wrapping
+   * [[ParseException]] (predicate syntax error) and [[AnalysisException]]
+   * (unknown column, type mismatch, etc.) in a [[HoodieAnalysisException]] 
that
+   * echoes the offending expression. Other exception types are allowed to
+   * propagate untouched so they aren't misreported as a filter problem.
+   */
+  private def applyFilter(df: DataFrame, filterExpr: String): DataFrame = {
+    try {
+      df.filter(filterExpr)
+    } catch {
+      case e @ (_: ParseException | _: AnalysisException) =>
+        throw new HoodieAnalysisException(
+          s"Invalid pre-filter expression '$filterExpr': ${e.getMessage}")
+    }
+  }
+
   override def buildSingleQueryPlan(
       spark: SparkSession,
       corpusDf: DataFrame,
       embeddingCol: String,
       queryVector: Array[Double],
       k: Int,
-      metric: DistanceMetric.Value): LogicalPlan = {
+      metric: DistanceMetric.Value,

Review Comment:
   🤖 nit: the `var x = ...; opt.foreach(f => x = ...)` pattern (used three 
times in this file — here, in `scored` for the single-query path, and again for 
`filteredCorpus`/`scored` in the batch path) is a bit unidiomatic in Scala. 
Could you express it as a `val` with `filter.map(applyFilter(base, 
_)).getOrElse(base)` or a small fold? That keeps the intermediate DataFrame 
immutable and avoids reassignment.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala:
##########
@@ -106,6 +115,37 @@ object HoodieVectorSearchTableValuedFunction {
     }
     kValue
   }
+
+  /** Parses a string argument that may be NULL (meaning "not specified"). */
+  private[logical] def parseOptionalString(
+      funcName: String, expr: Expression, argName: String): Option[String] = 
expr match {
+    case Literal(null, _) => None
+    case Literal(v, StringType) if v != null =>
+      val s = v.toString.trim
+      if (s.isEmpty) None else Some(s)
+    case _ => throw new HoodieAnalysisException(
+      s"Function '$funcName': argument '$argName' must be a string literal or 
NULL, got: ${expr.sql}")
+  }
+
+  /**
+   * Parses a numeric argument that may be NULL (meaning "not specified"). 
Accepts
+   * any foldable expression of [[NumericType]] (or [[NullType]] for an untyped
+   * NULL keyword) — including a bare literal or {@code CAST(literal AS 
numeric)} —
+   * and widens to Double. String literals are rejected even when their 
contents
+   * happen to parse as a number, so the type contract surfaces at parse time.
+   */
+  private[logical] def parseOptionalDouble(
+      funcName: String, expr: Expression, argName: String): Option[Double] = {
+    val numericOrNull = expr.dataType match {
+      case _: NumericType | NullType => true
+      case _ => false
+    }
+    if (!expr.foldable || !numericOrNull) {
+      throw new HoodieAnalysisException(
+        s"Function '$funcName': argument '$argName' must be a numeric literal 
or NULL, got: ${expr.sql}")
+    }
+    Option(expr.eval()).map(_.toString.toDouble)

Review Comment:
   🤖 nit: `_.toString.toDouble` is a bit roundabout for a value we've already 
verified is `NumericType`. A direct pattern match (e.g. `case n: Number => 
n.doubleValue(); case d: Decimal => d.toDouble`) would be clearer and avoid the 
string round-trip — especially for Decimal literals where the string detour is 
unnecessary work.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to