[
https://issues.apache.org/jira/browse/SPARK-38792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518460#comment-17518460
]
Danny Guinther commented on SPARK-38792:
----------------------------------------
I added another kind of dummy job that doesn't farm work out to the executors
at all and instead runs entirely on the driver while exercising most of the
code paths that a normal data flow would. Interestingly, it seems unimpacted by
the upgrade from 3.0.1 to 3.2.1 which suggests that the issue is strongly
related to passing work to the executor or to the executor doing work. I'd be
interested in ideas that might help me distinguish if the problem is
# Driver sending work to the executor
# Executor scheduling work
# Executor performing work
# Executor returning control to the driver
I'm not really sure how to exercise these different execution paths.
> Regression in time executor takes to do work sometime after v3.0.1 ?
> --------------------------------------------------------------------
>
> Key: SPARK-38792
> URL: https://issues.apache.org/jira/browse/SPARK-38792
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.2.1
> Reporter: Danny Guinther
> Priority: Major
> Attachments: dummy-job-job.jpg, dummy-job-query.png,
> min-time-way-up.jpg, what-s-up-with-exec-actions.jpg
>
>
> Hello!
> I'm sorry to trouble you with this, but I'm seeing a noticeable regression in
> performance when upgrading from 3.0.1 to 3.2.1 and I can't pin down why. I
> don't believe it is specific to my application since the upgrade to 3.0.1 to
> 3.2.1 is purely a configuration change. I'd guess it presents itself in my
> application due to the high volume of work my application does, but I could
> be mistaken.
> The gist is that it seems like the executor actions I'm running suddenly
> appear to take a lot longer on Spark 3.2.1. I don't have any ability to test
> versions between 3.0.1 and 3.2.1 because my application was previously
> blocked from upgrading beyond Spark 3.0.1 by
> https://issues.apache.org/jira/browse/SPARK-37391 (which I helped to fix).
> Any ideas what might cause this or metrics I might try to gather to pinpoint
> the problem? I've tried a bunch of the suggestions from
> [https://spark.apache.org/docs/latest/tuning.html] to see if any of those
> help, but none of the adjustments I've tried have been fruitful. I also tried
> to look in [https://spark.apache.org/docs/latest/sql-migration-guide.html]
> for ideas as to what might have changed to cause this behavior, but haven't
> seen anything that sticks out as being a possible source of the problem.
> I have attached a graph that shows the drastic change in time taken by
> executor actions. In the image the blue and purple lines are different kinds
> of reads using the built-in JDBC data reader and the green line is writes
> using a custom-built data writer. The deploy to switch from 3.0.1 to 3.2.1
> occurred at 9AM on the graph. The graph data comes from timing blocks that
> surround only the calls to dataframe actions, so there shouldn't be anything
> specific to my application that is suddenly inflating these numbers. The
> specific actions I'm invoking are: count() (but there's some transforming and
> caching going on, so it's really more than that); first(); and write().
> The driver process does seem to be seeing more GC churn then with Spark
> 3.0.1, but I don't think that explains this behavior. The executors don't
> seem to have any problem with memory or GC and are not overutilized (our
> pipeline is very read and write heavy, less heavy on transformations, so
> executors tend to be idle while waiting for various network I/O).
>
> Thanks in advance for any help!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]