ibzib commented on a change in pull request #13743:
URL: https://github.com/apache/beam/pull/13743#discussion_r561332986
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -274,6 +472,15 @@ public static void main(String[] args) throws Exception {
"The job to run. This must correspond to a subdirectory of the
jar's BEAM-PIPELINE "
+ "directory. *Only needs to be specified if the jar contains
multiple pipelines.*")
private String baseJobName = null;
+
+ @Option(
+ name = "--spark-history-dir",
+ usage = "Spark history dir to store logs (e.g. /tmp/spark-events/)")
+ private String sparkHistoryDir = "/tmp/spark-events/";
Review comment:
This isn't used anywhere?
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline
pipeline, JobInfo jobInfo)
"Will stage {} files. (Enable logging at DEBUG level to see which
files will be staged.)",
pipelineOptions.getFilesToStage().size());
LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
PortablePipelineResult result;
final JavaSparkContext jsc =
SparkContextFactory.getSparkContext(pipelineOptions);
+ EventLoggingListener eventLoggingListener;
+ String jobId = jobInfo.jobId();
+ String jobName = jobInfo.jobName();
+ Long startTime = jsc.startTime();
+ String sparkUser = jsc.sparkUser();
+ String sparkMaster = "";
+ String sparkExecutorID = "";
+ Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+ for (Tuple2<String, String> sparkConf : sparkConfList) {
+ if (sparkConf._1().equals("spark.master")) {
+ sparkMaster = sparkConf._2();
+ } else if (sparkConf._1().equals("spark.executor.id")) {
+ sparkExecutorID = sparkConf._2();
+ }
+ }
+ try {
+ URI eventLogDirectory = new URI(pipelineOptions.getSparkHistoryDir());
+ File eventLogDirectoryFile = new File(eventLogDirectory.getPath());
+ if (eventLogDirectoryFile.exists() &&
eventLogDirectoryFile.isDirectory()) {
+ eventLoggingListener =
+ new EventLoggingListener(
+ jobId,
+ new scala.Option<String>() {
+ @Override
+ public boolean isEmpty() {
+ return false;
+ }
+
+ @Override
+ public String get() {
+ return jobName;
+ }
+
+ @Override
+ public Object productElement(int i) {
+ return null;
+ }
+
+ @Override
+ public int productArity() {
+ return 0;
+ }
+
+ @Override
+ public boolean canEqual(Object o) {
+ return false;
+ }
+ },
+ eventLogDirectory,
+ jsc.getConf(),
+ jsc.hadoopConfiguration());
+ } else {
+ eventLoggingListener = null;
Review comment:
We throw an exception if the directory is missing, instead of silently
failing.
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
##########
@@ -34,6 +34,12 @@
*/
public interface SparkPipelineOptions extends SparkCommonPipelineOptions {
+ @Description("The directory to save Spark History Server logs")
+ @Default.String("/tmp/spark-events/")
Review comment:
Should we set `spark.eventLog.dir` in the Spark conf? Or does that not
matter?
https://github.com/apache/beam/blob/d1c8c241d57293b1fe9baa39f3b8e807b3a45d3a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java#L88
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline
pipeline, JobInfo jobInfo)
"Will stage {} files. (Enable logging at DEBUG level to see which
files will be staged.)",
pipelineOptions.getFilesToStage().size());
LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
PortablePipelineResult result;
final JavaSparkContext jsc =
SparkContextFactory.getSparkContext(pipelineOptions);
+ EventLoggingListener eventLoggingListener;
+ String jobId = jobInfo.jobId();
+ String jobName = jobInfo.jobName();
+ Long startTime = jsc.startTime();
+ String sparkUser = jsc.sparkUser();
+ String sparkMaster = "";
+ String sparkExecutorID = "";
+ Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+ for (Tuple2<String, String> sparkConf : sparkConfList) {
+ if (sparkConf._1().equals("spark.master")) {
+ sparkMaster = sparkConf._2();
+ } else if (sparkConf._1().equals("spark.executor.id")) {
+ sparkExecutorID = sparkConf._2();
+ }
+ }
+ try {
+ URI eventLogDirectory = new URI(pipelineOptions.getSparkHistoryDir());
+ File eventLogDirectoryFile = new File(eventLogDirectory.getPath());
+ if (eventLogDirectoryFile.exists() &&
eventLogDirectoryFile.isDirectory()) {
+ eventLoggingListener =
+ new EventLoggingListener(
+ jobId,
+ new scala.Option<String>() {
+ @Override
+ public boolean isEmpty() {
+ return false;
+ }
+
+ @Override
+ public String get() {
+ return jobName;
+ }
+
+ @Override
+ public Object productElement(int i) {
+ return null;
+ }
+
+ @Override
+ public int productArity() {
+ return 0;
+ }
+
+ @Override
+ public boolean canEqual(Object o) {
+ return false;
+ }
+ },
+ eventLogDirectory,
+ jsc.getConf(),
+ jsc.hadoopConfiguration());
+ } else {
+ eventLoggingListener = null;
+ }
+ } catch (URISyntaxException e) {
+ e.printStackTrace();
+ eventLoggingListener = null;
+ }
+ if (eventLoggingListener != null) {
Review comment:
If the user intends to use history logging, but we fail to set it up, we
should always throw an exception to the user instead of silently failing and
doing nothing.
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline
pipeline, JobInfo jobInfo)
"Will stage {} files. (Enable logging at DEBUG level to see which
files will be staged.)",
pipelineOptions.getFilesToStage().size());
LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
PortablePipelineResult result;
final JavaSparkContext jsc =
SparkContextFactory.getSparkContext(pipelineOptions);
+ EventLoggingListener eventLoggingListener;
+ String jobId = jobInfo.jobId();
+ String jobName = jobInfo.jobName();
+ Long startTime = jsc.startTime();
+ String sparkUser = jsc.sparkUser();
+ String sparkMaster = "";
+ String sparkExecutorID = "";
+ Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+ for (Tuple2<String, String> sparkConf : sparkConfList) {
+ if (sparkConf._1().equals("spark.master")) {
+ sparkMaster = sparkConf._2();
+ } else if (sparkConf._1().equals("spark.executor.id")) {
+ sparkExecutorID = sparkConf._2();
+ }
+ }
+ try {
+ URI eventLogDirectory = new URI(pipelineOptions.getSparkHistoryDir());
Review comment:
Why make this a URI first? Wouldn't it be easier to just do `new
File(pipelineOptions.getSparkHistoryDir())`?
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline
pipeline, JobInfo jobInfo)
"Will stage {} files. (Enable logging at DEBUG level to see which
files will be staged.)",
pipelineOptions.getFilesToStage().size());
LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
PortablePipelineResult result;
final JavaSparkContext jsc =
SparkContextFactory.getSparkContext(pipelineOptions);
+ EventLoggingListener eventLoggingListener;
+ String jobId = jobInfo.jobId();
+ String jobName = jobInfo.jobName();
+ Long startTime = jsc.startTime();
+ String sparkUser = jsc.sparkUser();
+ String sparkMaster = "";
+ String sparkExecutorID = "";
+ Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+ for (Tuple2<String, String> sparkConf : sparkConfList) {
+ if (sparkConf._1().equals("spark.master")) {
+ sparkMaster = sparkConf._2();
+ } else if (sparkConf._1().equals("spark.executor.id")) {
+ sparkExecutorID = sparkConf._2();
Review comment:
What if there are multiple executors?
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline
pipeline, JobInfo jobInfo)
"Will stage {} files. (Enable logging at DEBUG level to see which
files will be staged.)",
pipelineOptions.getFilesToStage().size());
LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
PortablePipelineResult result;
final JavaSparkContext jsc =
SparkContextFactory.getSparkContext(pipelineOptions);
+ EventLoggingListener eventLoggingListener;
+ String jobId = jobInfo.jobId();
+ String jobName = jobInfo.jobName();
+ Long startTime = jsc.startTime();
+ String sparkUser = jsc.sparkUser();
+ String sparkMaster = "";
+ String sparkExecutorID = "";
+ Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+ for (Tuple2<String, String> sparkConf : sparkConfList) {
+ if (sparkConf._1().equals("spark.master")) {
+ sparkMaster = sparkConf._2();
+ } else if (sparkConf._1().equals("spark.executor.id")) {
+ sparkExecutorID = sparkConf._2();
+ }
+ }
+ try {
+ URI eventLogDirectory = new URI(pipelineOptions.getSparkHistoryDir());
+ File eventLogDirectoryFile = new File(eventLogDirectory.getPath());
+ if (eventLogDirectoryFile.exists() &&
eventLogDirectoryFile.isDirectory()) {
+ eventLoggingListener =
+ new EventLoggingListener(
+ jobId,
+ new scala.Option<String>() {
Review comment:
There has to be a way to create an `Option` without all this
boilerplate. I'm not too familiar with Scala, but I think `Some` might be
better here? https://www.scala-lang.org/api/current/scala/Some.html
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline
pipeline, JobInfo jobInfo)
"Will stage {} files. (Enable logging at DEBUG level to see which
files will be staged.)",
pipelineOptions.getFilesToStage().size());
LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
PortablePipelineResult result;
final JavaSparkContext jsc =
SparkContextFactory.getSparkContext(pipelineOptions);
+ EventLoggingListener eventLoggingListener;
+ String jobId = jobInfo.jobId();
+ String jobName = jobInfo.jobName();
+ Long startTime = jsc.startTime();
Review comment:
We may reuse the Spark context, in which case `jsc.startTime()` might
not be an accurate measure of the application start time.
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -213,6 +299,118 @@ public PortablePipelineResult run(RunnerApi.Pipeline
pipeline, JobInfo jobInfo)
result);
metricsPusher.start();
+ if (eventLoggingListener != null) {
+ HashMap<String, String> driverLogs = new HashMap<String, String>();
+ MetricResults metricResults = result.metrics();
+ for (MetricResult<DistributionResult> distributionResultMetricResult :
Review comment:
What about the other metric types (counters and gauges)?
##########
File path: runners/spark/job-server/build.gradle
##########
@@ -73,6 +73,8 @@ runShadow {
args +=
["--clean-artifacts-per-job=${project.property('cleanArtifactsPerJob')}"]
if (project.hasProperty('sparkMasterUrl'))
args += ["--spark-master-url=${project.property('sparkMasterUrl')}"]
+ if (project.hasProperty('sparkHistoryDir'))
Review comment:
I don't think we should add `sparkHistoryDir` as an argument to the job
server.
(I don't think sparkMasterUrl is really necessary either, since it was
already a pipeline option. But it's probably too late to remove it now.)
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
##########
@@ -34,6 +34,12 @@
*/
public interface SparkPipelineOptions extends SparkCommonPipelineOptions {
+ @Description("The directory to save Spark History Server logs")
+ @Default.String("/tmp/spark-events/")
Review comment:
Perhaps we should also add an option like `isEventLogEnabled` and make
it `false` by default, like Spark's `spark.eventLog.enabled` property.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]