ibzib commented on a change in pull request #13743:
URL: https://github.com/apache/beam/pull/13743#discussion_r561332986



##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -274,6 +472,15 @@ public static void main(String[] args) throws Exception {
             "The job to run. This must correspond to a subdirectory of the 
jar's BEAM-PIPELINE "
                 + "directory. *Only needs to be specified if the jar contains 
multiple pipelines.*")
     private String baseJobName = null;
+
+    @Option(
+        name = "--spark-history-dir",
+        usage = "Spark history dir to store logs (e.g. /tmp/spark-events/)")
+    private String sparkHistoryDir = "/tmp/spark-events/";

Review comment:
       This isn't used anywhere?

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline 
pipeline, JobInfo jobInfo)
         "Will stage {} files. (Enable logging at DEBUG level to see which 
files will be staged.)",
         pipelineOptions.getFilesToStage().size());
     LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
     PortablePipelineResult result;
     final JavaSparkContext jsc = 
SparkContextFactory.getSparkContext(pipelineOptions);
 
+    EventLoggingListener eventLoggingListener;
+    String jobId = jobInfo.jobId();
+    String jobName = jobInfo.jobName();
+    Long startTime = jsc.startTime();
+    String sparkUser = jsc.sparkUser();
+    String sparkMaster = "";
+    String sparkExecutorID = "";
+    Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+    for (Tuple2<String, String> sparkConf : sparkConfList) {
+      if (sparkConf._1().equals("spark.master")) {
+        sparkMaster = sparkConf._2();
+      } else if (sparkConf._1().equals("spark.executor.id")) {
+        sparkExecutorID = sparkConf._2();
+      }
+    }
+    try {
+      URI eventLogDirectory = new URI(pipelineOptions.getSparkHistoryDir());
+      File eventLogDirectoryFile = new File(eventLogDirectory.getPath());
+      if (eventLogDirectoryFile.exists() && 
eventLogDirectoryFile.isDirectory()) {
+        eventLoggingListener =
+            new EventLoggingListener(
+                jobId,
+                new scala.Option<String>() {
+                  @Override
+                  public boolean isEmpty() {
+                    return false;
+                  }
+
+                  @Override
+                  public String get() {
+                    return jobName;
+                  }
+
+                  @Override
+                  public Object productElement(int i) {
+                    return null;
+                  }
+
+                  @Override
+                  public int productArity() {
+                    return 0;
+                  }
+
+                  @Override
+                  public boolean canEqual(Object o) {
+                    return false;
+                  }
+                },
+                eventLogDirectory,
+                jsc.getConf(),
+                jsc.hadoopConfiguration());
+      } else {
+        eventLoggingListener = null;

Review comment:
       We throw an exception if the directory is missing, instead of silently 
failing.

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
##########
@@ -34,6 +34,12 @@
  */
 public interface SparkPipelineOptions extends SparkCommonPipelineOptions {
 
+  @Description("The directory to save Spark History Server logs")
+  @Default.String("/tmp/spark-events/")

Review comment:
       Should we set `spark.eventLog.dir` in the Spark conf? Or does that not 
matter? 
https://github.com/apache/beam/blob/d1c8c241d57293b1fe9baa39f3b8e807b3a45d3a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java#L88

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline 
pipeline, JobInfo jobInfo)
         "Will stage {} files. (Enable logging at DEBUG level to see which 
files will be staged.)",
         pipelineOptions.getFilesToStage().size());
     LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
     PortablePipelineResult result;
     final JavaSparkContext jsc = 
SparkContextFactory.getSparkContext(pipelineOptions);
 
+    EventLoggingListener eventLoggingListener;
+    String jobId = jobInfo.jobId();
+    String jobName = jobInfo.jobName();
+    Long startTime = jsc.startTime();
+    String sparkUser = jsc.sparkUser();
+    String sparkMaster = "";
+    String sparkExecutorID = "";
+    Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+    for (Tuple2<String, String> sparkConf : sparkConfList) {
+      if (sparkConf._1().equals("spark.master")) {
+        sparkMaster = sparkConf._2();
+      } else if (sparkConf._1().equals("spark.executor.id")) {
+        sparkExecutorID = sparkConf._2();
+      }
+    }
+    try {
+      URI eventLogDirectory = new URI(pipelineOptions.getSparkHistoryDir());
+      File eventLogDirectoryFile = new File(eventLogDirectory.getPath());
+      if (eventLogDirectoryFile.exists() && 
eventLogDirectoryFile.isDirectory()) {
+        eventLoggingListener =
+            new EventLoggingListener(
+                jobId,
+                new scala.Option<String>() {
+                  @Override
+                  public boolean isEmpty() {
+                    return false;
+                  }
+
+                  @Override
+                  public String get() {
+                    return jobName;
+                  }
+
+                  @Override
+                  public Object productElement(int i) {
+                    return null;
+                  }
+
+                  @Override
+                  public int productArity() {
+                    return 0;
+                  }
+
+                  @Override
+                  public boolean canEqual(Object o) {
+                    return false;
+                  }
+                },
+                eventLogDirectory,
+                jsc.getConf(),
+                jsc.hadoopConfiguration());
+      } else {
+        eventLoggingListener = null;
+      }
+    } catch (URISyntaxException e) {
+      e.printStackTrace();
+      eventLoggingListener = null;
+    }
+    if (eventLoggingListener != null) {

Review comment:
       If the user intends to use history logging, but we fail to set it up, we 
should always throw an exception to the user instead of silently failing and 
doing nothing. 

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline 
pipeline, JobInfo jobInfo)
         "Will stage {} files. (Enable logging at DEBUG level to see which 
files will be staged.)",
         pipelineOptions.getFilesToStage().size());
     LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
     PortablePipelineResult result;
     final JavaSparkContext jsc = 
SparkContextFactory.getSparkContext(pipelineOptions);
 
+    EventLoggingListener eventLoggingListener;
+    String jobId = jobInfo.jobId();
+    String jobName = jobInfo.jobName();
+    Long startTime = jsc.startTime();
+    String sparkUser = jsc.sparkUser();
+    String sparkMaster = "";
+    String sparkExecutorID = "";
+    Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+    for (Tuple2<String, String> sparkConf : sparkConfList) {
+      if (sparkConf._1().equals("spark.master")) {
+        sparkMaster = sparkConf._2();
+      } else if (sparkConf._1().equals("spark.executor.id")) {
+        sparkExecutorID = sparkConf._2();
+      }
+    }
+    try {
+      URI eventLogDirectory = new URI(pipelineOptions.getSparkHistoryDir());

Review comment:
       Why make this a URI first? Wouldn't it be easier to just do `new 
File(pipelineOptions.getSparkHistoryDir())`?

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline 
pipeline, JobInfo jobInfo)
         "Will stage {} files. (Enable logging at DEBUG level to see which 
files will be staged.)",
         pipelineOptions.getFilesToStage().size());
     LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
     PortablePipelineResult result;
     final JavaSparkContext jsc = 
SparkContextFactory.getSparkContext(pipelineOptions);
 
+    EventLoggingListener eventLoggingListener;
+    String jobId = jobInfo.jobId();
+    String jobName = jobInfo.jobName();
+    Long startTime = jsc.startTime();
+    String sparkUser = jsc.sparkUser();
+    String sparkMaster = "";
+    String sparkExecutorID = "";
+    Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+    for (Tuple2<String, String> sparkConf : sparkConfList) {
+      if (sparkConf._1().equals("spark.master")) {
+        sparkMaster = sparkConf._2();
+      } else if (sparkConf._1().equals("spark.executor.id")) {
+        sparkExecutorID = sparkConf._2();

Review comment:
       What if there are multiple executors?

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline 
pipeline, JobInfo jobInfo)
         "Will stage {} files. (Enable logging at DEBUG level to see which 
files will be staged.)",
         pipelineOptions.getFilesToStage().size());
     LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
     PortablePipelineResult result;
     final JavaSparkContext jsc = 
SparkContextFactory.getSparkContext(pipelineOptions);
 
+    EventLoggingListener eventLoggingListener;
+    String jobId = jobInfo.jobId();
+    String jobName = jobInfo.jobName();
+    Long startTime = jsc.startTime();
+    String sparkUser = jsc.sparkUser();
+    String sparkMaster = "";
+    String sparkExecutorID = "";
+    Tuple2<String, String>[] sparkConfList = jsc.getConf().getAll();
+    for (Tuple2<String, String> sparkConf : sparkConfList) {
+      if (sparkConf._1().equals("spark.master")) {
+        sparkMaster = sparkConf._2();
+      } else if (sparkConf._1().equals("spark.executor.id")) {
+        sparkExecutorID = sparkConf._2();
+      }
+    }
+    try {
+      URI eventLogDirectory = new URI(pipelineOptions.getSparkHistoryDir());
+      File eventLogDirectoryFile = new File(eventLogDirectory.getPath());
+      if (eventLogDirectoryFile.exists() && 
eventLogDirectoryFile.isDirectory()) {
+        eventLoggingListener =
+            new EventLoggingListener(
+                jobId,
+                new scala.Option<String>() {

Review comment:
       There has to be a way to create an `Option` without all this 
boilerplate. I'm not too familiar with Scala, but I think `Some` might be 
better here? https://www.scala-lang.org/api/current/scala/Some.html

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -123,10 +140,79 @@ public PortablePipelineResult run(RunnerApi.Pipeline 
pipeline, JobInfo jobInfo)
         "Will stage {} files. (Enable logging at DEBUG level to see which 
files will be staged.)",
         pipelineOptions.getFilesToStage().size());
     LOG.debug("Staging files: {}", pipelineOptions.getFilesToStage());
-
     PortablePipelineResult result;
     final JavaSparkContext jsc = 
SparkContextFactory.getSparkContext(pipelineOptions);
 
+    EventLoggingListener eventLoggingListener;
+    String jobId = jobInfo.jobId();
+    String jobName = jobInfo.jobName();
+    Long startTime = jsc.startTime();

Review comment:
       We may reuse the Spark context, in which case `jsc.startTime()` might 
not be an accurate measure of the application start time.

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineRunner.java
##########
@@ -213,6 +299,118 @@ public PortablePipelineResult run(RunnerApi.Pipeline 
pipeline, JobInfo jobInfo)
             result);
     metricsPusher.start();
 
+    if (eventLoggingListener != null) {
+      HashMap<String, String> driverLogs = new HashMap<String, String>();
+      MetricResults metricResults = result.metrics();
+      for (MetricResult<DistributionResult> distributionResultMetricResult :

Review comment:
       What about the other metric types (counters and gauges)?

##########
File path: runners/spark/job-server/build.gradle
##########
@@ -73,6 +73,8 @@ runShadow {
     args += 
["--clean-artifacts-per-job=${project.property('cleanArtifactsPerJob')}"]
   if (project.hasProperty('sparkMasterUrl'))
     args += ["--spark-master-url=${project.property('sparkMasterUrl')}"]
+  if (project.hasProperty('sparkHistoryDir'))

Review comment:
       I don't think we should add `sparkHistoryDir` as an argument to the job 
server.
   
   (I don't think sparkMasterUrl is really necessary either, since it was 
already a pipeline option. But it's probably too late to remove it now.)

##########
File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
##########
@@ -34,6 +34,12 @@
  */
 public interface SparkPipelineOptions extends SparkCommonPipelineOptions {
 
+  @Description("The directory to save Spark History Server logs")
+  @Default.String("/tmp/spark-events/")

Review comment:
       Perhaps we should also add an option like `isEventLogEnabled` and make 
it `false` by default, like Spark's `spark.eventLog.enabled` property. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to