freamdx commented on PR #1701:
URL: https://github.com/apache/sedona/pull/1701#issuecomment-2502898908
> > so, SessionState Field("sqlParser") is always SedonaSqlParser, it
doesn't matter the inject order
>
> From what I observed, `sparkSession.sessionState.sqlParser` will be
`IcebergSparkSqlExtensionsParser` right after iceberg extension is applied. If
Sedona is initialized afterwards, it will replace
`sparkSession.sessionState.sqlParser` with `SedonaSqlParser`. The delegation
hierarchy will be something like this:
>
> ```
> SedonaSqlParser
> └─ IcebergSparkSqlExtensionsParser
> └─ <delegated parser instance>
> ```
>
> If `SedonaSqlParser` cannot fallback to the delegated parser, the special
SQL syntaxes such as `CALL ...` defined by iceberg won't work. Here is an
example using iceberg 1.7.0 and a locally built Sedona with this patch applied:
>
> ```
> $ pyspark --packages
org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0 \
> --jars "$SEDONA_SHADED_JAR" \
> --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
> --conf
spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
> --conf spark.sql.catalog.local.type=hadoop \
> --conf
spark.sql.catalog.local.warehouse=$HOME/local/iceberg/warehouse
>
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 3.5.1
> /_/
>
> Using Python version 3.11.6 (main, Oct 2 2023 20:46:14)
> Spark context Web UI available at http://192.168.200.155:4040
> Spark context available as 'sc' (master = local[*], app id =
local-1732620541890).
> SparkSession available as 'spark'.
> >>> from sedona.spark import *
> >>> sedona = SedonaContext.create(spark)
> 24/11/26 19:33:39 WARN RasterRegistrator$: Geotools was not found on the
classpath. Raster operations will not be available.
> >>> spark.sql("CALL local.system.remove_orphan_files (table =>
'test_db.test_table', dry_run => true)").show()
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
"/workspace/local/spark/spark-3.5.1-bin-hadoop3/python/pyspark/sql/session.py",
line 1631, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File
"/workspace/local/spark/spark-3.5.1-bin-hadoop3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
line 1322, in __call__
> File
"/workspace/local/spark/spark-3.5.1-bin-hadoop3/python/pyspark/errors/exceptions/captured.py",
line 185, in deco
> raise converted from None
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'CALL'.(line 1, pos 0)
>
> == SQL ==
> CALL local.system.remove_orphan_files (table => 'test_db.test_table',
dry_run => true)
> ^^^
>
> >>>
> ```
>
> Even if we add `SedonaSqlExtensions` to `spark.sql.extensions` and get rid
of manual initialization, there's still problem with SQL parsing:
>
> ```
> $ pyspark --packages
org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0 \
> --jars "$SEDONA_SHADED_JAR" \
> --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.apache.sedona.sql.SedonaSqlExtensions
\
> --conf
spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
> --conf spark.sql.catalog.local.type=hadoop \
> --conf
spark.sql.catalog.local.warehouse=$HOME/local/iceberg/warehouse
>
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 3.5.1
> /_/
>
> Using Python version 3.11.6 (main, Oct 2 2023 20:46:14)
> Spark context Web UI available at http://192.168.200.155:4040
> Spark context available as 'sc' (master = local[*], app id =
local-1732621080857).
> SparkSession available as 'spark'.
> >>> spark.sql("CALL local.system.remove_orphan_files (table =>
'test_db.test_table', dry_run => true)").show()
> 24/11/26 19:38:14 WARN RasterRegistrator$: Geotools was not found on the
classpath. Raster operations will not be available.
> +--------------------+
> |orphan_file_location|
> +--------------------+
> +--------------------+
>
> >>> spark.sql("CALL local.system.remove_orphan_files (table =>
'test_db.test_table', dry_run => true)").show()
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
"/workspace/local/spark/spark-3.5.1-bin-hadoop3/python/pyspark/sql/session.py",
line 1631, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File
"/workspace/local/spark/spark-3.5.1-bin-hadoop3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
line 1322, in __call__
> File
"/workspace/local/spark/spark-3.5.1-bin-hadoop3/python/pyspark/errors/exceptions/captured.py",
line 185, in deco
> raise converted from None
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'CALL'.(line 1, pos 0)
>
> == SQL ==
> CALL local.system.remove_orphan_files (table => 'test_db.test_table',
dry_run => true)
> ^^^
>
> >>>
> ```
>
> I like the idea of overriding `astBuilder`, as it enables better
integration with iceberg. However I'm a bit worried about removing the fallback
path, since it may break iceberg's sql parsing when sedona and iceberg coexists.
you are right, my mistake.
i'll canceled this patch first, after iceberg-spark-extensions ut passed
with sedona
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]