Max Payson created SPARK-46190: ---------------------------------- Summary: ANSI Double quoted identifiers do not work in Python threads Key: SPARK-46190 URL: https://issues.apache.org/jira/browse/SPARK-46190 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.0, 3.4.1, 3.4.0 Reporter: Max Payson
Enabling and using `spark.sql.ansi.doubleQuotedIdentifiers` does not work correctly in Python threads The following example shows how applying a filter, "\"status\" = 'Unchanged'", leads to empty results when run in a thread. I believe this is because the "status" field is interpreted as a literal in the thread, but as an attribute outside of it. {code:python} from concurrent import futures from pyspark import sql spark = ( sql.SparkSession.builder.master("local[*]") .config("spark.sql.ansi.enabled", "true") .config("spark.sql.ansi.doubleQuotedIdentifiers", "true") .getOrCreate() ) def demonstrate_issue(spark): # Path to JSON file with contents: # [{"status": "Unchanged"}, {"status": "Changed"}] df = spark.read.json("data/example.json") df.filter("\"status\" = 'Unchanged'").show() # Shows 1 record, expected demonstrate_issue(spark) with futures.ThreadPoolExecutor(1) as executor: # Shows 0 records, unexpected executor.submit(demonstrate_issue, spark) {code} Additional testing notes: * When parsing the expression with `sql.functions.expr` in Java via Py4J, the "status" field is interpreted as a literal value from the thread, not an attribute * Using double quotes with `spark.sql` does work in the thread * Using a dataframe created in memory does work in the thread * Tested in versions 3.4.0, 3.4.1, & 3.5.0 on Windows and Mac The original PR that added this option is here: [https://github.com/apache/spark/pull/38022] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org