RE: The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-16 Thread Stevens, Clay
Perhaps you also need to upgrade Scala?

Clay Stevens

From: Hanyu Huang 
Sent: Wednesday, 15 November, 2023 1:15 AM
To: user@spark.apache.org
Subject: The job failed when we upgraded from spark 3.3.1 to spark3.4.1

Caution, this email may be from a sender outside Wolters Kluwer. Verify the 
sender and know the content is safe.
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, 
But since we upgraded to  spark3.4.1 and Apache Iceberg to 1.3.1, jobs started 
to fail frequently, We tried to upgrade only iceberg without upgrading spark, 
and the job did not report an error.


Detailed description:

When we execute this function writing data to the iceberg table:

def appendToIcebergTable(targetTable: String, df: DataFrame): Unit = {
  _logger.warn(s"Append data to $targetTable")

  val (targetCols, sourceCols) = matchDFSchemaWithTargetTable(targetTable, df)
  df.createOrReplaceTempView("_temp")
  spark.sql(s"""
  INSERT INTO $targetTable ($targetCols) SELECT $sourceCols FROM _temp
  """)
  _logger.warn(s"Done append data to $targetTable")
  getIcebergLastAppendCountVerbose(targetTable)
}


The error is reported as follows:
Caused by: java.lang.AssertionError: assertion failed at 
scala.Predef$.assert(Predef.scala:208) at 
org.apache.spark.sql.execution.ColumnarToRowExec.(Columnar.scala:72) ... 
191 more


Read the source code and find that the error is reported here:


case class ColumnarToRowExec(child: SparkPlan) extends ColumnarToRowTransition 
with CodegenSupport {
  // supportsColumnar requires to be only called on driver side, see also 
SPARK-37779.
  assert(Utils.isInRunningSparkTask || child.supportsColumnar)

  override def output: Seq[Attribute] = child.output

  override def outputPartitioning: Partitioning = child.outputPartitioning

  override def outputOrdering: Seq[SortOrder] = child.outputOrdering

But we can't find the root cause,So seek help from the community ,If more log 
information is required, please let me know.

thanks


Re: Spark-submit without access to HDFS

2023-11-16 Thread Jörn Franke
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I would recommend against it though for various reasons, such as reliability)Am 15.11.2023 um 22:33 schrieb Eugene Miretsky :Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit the job from the client. We tried something like this'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'The error we are getting is "org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.117.110.19:9866]org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/users/.sparkStaging/application_1698216436656_0104/spark_conf.zip could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation." A few question 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf files? Why would the client send them to the cluster? (the cluster already has all that info - this would make sense in client mode, but not cluster mode )2) Is it possible to use spark-submit without HDFS access? 3) How would we fix this?  Cheers,Eugene-- Eugene MiretskyManaging Partner |  Badal.io | Book a meeting /w me! mobile:  416-568-9245email:     eug...@badal.io