RE: The job failed when we upgraded from spark 3.3.1 to spark3.4.1
Perhaps you also need to upgrade Scala? Clay Stevens From: Hanyu Huang Sent: Wednesday, 15 November, 2023 1:15 AM To: user@spark.apache.org Subject: The job failed when we upgraded from spark 3.3.1 to spark3.4.1 Caution, this email may be from a sender outside Wolters Kluwer. Verify the sender and know the content is safe. The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description: When we execute this function writing data to the iceberg table: def appendToIcebergTable(targetTable: String, df: DataFrame): Unit = { _logger.warn(s"Append data to $targetTable") val (targetCols, sourceCols) = matchDFSchemaWithTargetTable(targetTable, df) df.createOrReplaceTempView("_temp") spark.sql(s""" INSERT INTO $targetTable ($targetCols) SELECT $sourceCols FROM _temp """) _logger.warn(s"Done append data to $targetTable") getIcebergLastAppendCountVerbose(targetTable) } The error is reported as follows: Caused by: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.execution.ColumnarToRowExec.(Columnar.scala:72) ... 191 more Read the source code and find that the error is reported here: case class ColumnarToRowExec(child: SparkPlan) extends ColumnarToRowTransition with CodegenSupport { // supportsColumnar requires to be only called on driver side, see also SPARK-37779. assert(Utils.isInRunningSparkTask || child.supportsColumnar) override def output: Seq[Attribute] = child.output override def outputPartitioning: Partitioning = child.outputPartitioning override def outputOrdering: Seq[SortOrder] = child.outputOrdering But we can't find the root cause,So seek help from the community ,If more log information is required, please let me know. thanks
Re: Spark-submit without access to HDFS
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I would recommend against it though for various reasons, such as reliability)Am 15.11.2023 um 22:33 schrieb Eugene Miretsky :Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit the job from the client. We tried something like this'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'The error we are getting is "org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.117.110.19:9866]org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/users/.sparkStaging/application_1698216436656_0104/spark_conf.zip could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation." A few question 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf files? Why would the client send them to the cluster? (the cluster already has all that info - this would make sense in client mode, but not cluster mode )2) Is it possible to use spark-submit without HDFS access? 3) How would we fix this? Cheers,Eugene-- Eugene MiretskyManaging Partner | Badal.io | Book a meeting /w me! mobile: 416-568-9245email: eug...@badal.io