Re: Diffing execution plans to understand an optimizer bug

Herman van Hövell tot Westerflier Tue, 08 Nov 2016 14:43:26 -0800

Replied in the ticket.

On Tue, Nov 8, 2016 at 11:36 PM, Nicholas Chammas <
[email protected]> wrote:


> SPARK-18367 <https://issues.apache.org/jira/browse/SPARK-18367>: limit()
> makes the lame walk again
>
> On Tue, Nov 8, 2016 at 5:00 PM Nicholas Chammas <
> [email protected]> wrote:
>
>> Hmm, it doesn’t seem like I can access the output of
>> df._jdf.queryExecution().hiveResultString() from Python, and until I can
>> boil the issue down a bit, I’m stuck with using Python.
>>
>> I’ll have a go at using regexes to strip some stuff from the printed
>> plans. The one that’s working for me to strip the IDs is #\d+L?.
>>
>> Nick
>> 
>>
>> On Tue, Nov 8, 2016 at 4:47 PM Reynold Xin <[email protected]> wrote:
>>
>> If you want to peek into the internals and do crazy things, it is much
>> easier to do it in Scala with df.queryExecution.
>>
>> For explain string output, you can work around the comparison simply by
>> doing replaceAll("#\\d+", "#x")
>>
>> similar to the patch here: https://github.com/apache/spark/commit/
>> fd90541c35af2bccf0155467bec8cea7c8865046#diff-
>> 432455394ca50800d5de508861984ca5R217
>>
>>
>>
>> On Tue, Nov 8, 2016 at 1:42 PM, Nicholas Chammas <
>> [email protected]> wrote:
>>
>> I’m trying to understand what I think is an optimizer bug. To do that,
>> I’d like to compare the execution plans for a certain query with and
>> without a certain change, to understand how that change is impacting the
>> plan.
>>
>> How would I do that in PySpark? I’m working with 2.0.1, but I can use
>> master if it helps.
>>
>> explain()
>> <http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.explain>
>> is helpful but is limited in two important ways:
>>
>>    1. It prints to screen and doesn’t offer another way to access the
>>    plan or capture it.
>>    2.
>>
>>    The printed plan includes auto-generated IDs that make diffing
>>    impossible. e.g.
>>
>>     == Physical Plan ==
>>     *Project [struct(primary_key#722, person#550, dataset_name#671)
>>
>>
>> Any suggestions on what to do? Any relevant JIRAs I should follow?
>>
>> Nick
>> 
>>
>>
>>

Re: Diffing execution plans to understand an optimizer bug

Reply via email to