I’m trying to understand what I think is an optimizer bug. To do that, I’d
like to compare the execution plans for a certain query with and without a
certain change, to understand how that change is impacting the plan.

How would I do that in PySpark? I’m working with 2.0.1, but I can use
master if it helps.

explain()
<http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.explain>
is helpful but is limited in two important ways:

   1. It prints to screen and doesn’t offer another way to access the plan
   or capture it.
   2.

   The printed plan includes auto-generated IDs that make diffing
   impossible. e.g.

    == Physical Plan ==
    *Project [struct(primary_key#722, person#550, dataset_name#671)


Any suggestions on what to do? Any relevant JIRAs I should follow?

Nick
​

Reply via email to