[jira] [Created] (SPARK-46190) ANSI Double quoted identifiers do not work in Python threads

Max Payson (Jira) Thu, 30 Nov 2023 14:31:05 -0800

Max Payson created SPARK-46190:
----------------------------------

             Summary: ANSI Double quoted identifiers do not work in Python 
threads
                 Key: SPARK-46190
                 URL: https://issues.apache.org/jira/browse/SPARK-46190
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.5.0, 3.4.1, 3.4.0
            Reporter: Max Payson



Enabling and using `spark.sql.ansi.doubleQuotedIdentifiers` does not work 
correctly in Python threads

The following example shows how applying a filter, "\"status\" = 'Unchanged'", 
leads to empty results when run in a thread. I believe this is because the 
"status" field is interpreted as a literal in the thread, but as an attribute 
outside of it.
{code:python}
from concurrent import futures
from pyspark import sql

spark = (
  sql.SparkSession.builder.master("local[*]")
  .config("spark.sql.ansi.enabled", "true")
  .config("spark.sql.ansi.doubleQuotedIdentifiers", "true")
  .getOrCreate()
)

def demonstrate_issue(spark):
  # Path to JSON file with contents:
  # [{"status": "Unchanged"}, {"status": "Changed"}]
  df = spark.read.json("data/example.json")
  df.filter("\"status\" = 'Unchanged'").show()

# Shows 1 record, expected
demonstrate_issue(spark)

with futures.ThreadPoolExecutor(1) as executor:
  # Shows 0 records, unexpected
  executor.submit(demonstrate_issue, spark)
 {code}
 

Additional testing notes:
 * When parsing the expression with `sql.functions.expr` in Java via Py4J, the 
"status" field is interpreted as a literal value from the thread, not an 
attribute
 * Using double quotes with `spark.sql` does work in the thread
 * Using a dataframe created in memory does work in the thread
 * Tested in versions 3.4.0, 3.4.1, & 3.5.0 on Windows and Mac

 

The original PR that added this option is here: 
[https://github.com/apache/spark/pull/38022]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46190) ANSI Double quoted identifiers do not work in Python threads

Reply via email to