HyukjinKwon opened a new pull request #26517: [SPARK-26923][R][SQL][FOLLOW-UP] Show stderr in the exception whenever possible in RRunner URL: https://github.com/apache/spark/pull/26517 ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/commit/3725b1324f731d57dc776c256bc1a100ec9e6cd0#diff-71c2cad03f08cb5f6c70462aa4e28d3aL112. I made a mistake related to this line: https://github.com/apache/spark/commit/3725b1324f731d57dc776c256bc1a100ec9e6cd0#diff-71c2cad03f08cb5f6c70462aa4e28d3aL112 Previously, 1. the reader iterator for R worker read some initial data eagerly during planning. So it read the data before actual execution. For some reasons, in this case, it showed standard error from R worker. 2. After that, when error happens during actual execution, stderr wasn't shown: https://github.com/apache/spark/commit/3725b1324f731d57dc776c256bc1a100ec9e6cd0#diff-71c2cad03f08cb5f6c70462aa4e28d3aL260 After my change https://github.com/apache/spark/commit/3725b1324f731d57dc776c256bc1a100ec9e6cd0#diff-71c2cad03f08cb5f6c70462aa4e28d3aL112, it now ignores 1. case and only does 2. of previous code path, because 1. does not happen anymore as I avoided to such eager execution (which is consistent with PySpark code path). This PR proposes to do only 1. It is pretty much possible R worker was failed during actual execution and it's best to show the stderr from R worker whenever possible. ### Why are the changes needed? It swallows standard error from R worker which makes debugging harder. ### Does this PR introduce any user-facing change? Yes, ```R df <- createDataFrame(list(list(n=1))) collect(dapply(df, function(x) { stop("asdkjasdjkbadskjbsdajbk") x }, structType("a double"))) ``` **Before:** ``` Error in handleErrors(returnStatus, conn) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, 192.168.35.193, executor driver): org.apache.spark.SparkException: R worker exited unexpectedly (cranshed) at org.apache.spark.api.r.RRunner$$anon$1.read(RRunner.scala:130) at org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:118) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) at org.apache.spark. ``` **After:** ``` Error in handleErrors(returnStatus, conn) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 192.168.35.193, executor driver): org.apache.spark.SparkException: R unexpectedly exited. R worker produced errors: Error in computeFunc(inputData) : asdkjasdjkbadskjbsdajbk at org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144) at org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.api.r.RRunner$$anon$1.read(RRunner.scala:128) at org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegen ``` ### How was this patch tested? Manually tested and unittest was added.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
