I’m trying to understand what I think is an optimizer bug. To do that, I’d like to compare the execution plans for a certain query with and without a certain change, to understand how that change is impacting the plan.
How would I do that in PySpark? I’m working with 2.0.1, but I can use master if it helps. explain() <http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.explain> is helpful but is limited in two important ways: 1. It prints to screen and doesn’t offer another way to access the plan or capture it. 2. The printed plan includes auto-generated IDs that make diffing impossible. e.g. == Physical Plan == *Project [struct(primary_key#722, person#550, dataset_name#671) Any suggestions on what to do? Any relevant JIRAs I should follow? Nick