Re: Performance Problems Migrating to S3A Committers

2021-08-05 Thread James Yu
See this ticket https://issues.apache.org/jira/browse/HADOOP-17201. It may help your team. From: Johnny Burns Sent: Tuesday, June 22, 2021 3:41 PM To: user@spark.apache.org Cc: data-orchestration-team Subject: Performance Problems Migrating to S3A Committers

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

2021-08-05 Thread Sean Owen
Doesn't a persist break stages? On Thu, Aug 5, 2021, 11:40 AM Tom Graves wrote: > As Sean mentioned its only available at Stage level but you said you don't > want to shuffle so splitting into stages doesn't help you. Without more > details it seems like you could "hack" this by just

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

2021-08-05 Thread Tom Graves
As Sean mentioned its only available at Stage level but you said you don't want to shuffle so splitting into stages doesn't help you.  Without more details it seems like you could "hack" this by just requesting an executor with 1 GPU (allowing 2 tasks per gpu) and 2 CPUs and the one task would

Re: How can transform RDD[Seq[String]] to RDD[ROW]

2021-08-05 Thread Artemis User
I am not sure why you need to create an RDD first.  You can create a data frame directly from csv file, for instance: spark.read.format("csv").option("header","true").schema(yourSchema).load(ftpUrl) -- ND On 8/5/21 3:14 AM, igyu wrote: val ftpUrl

Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

2021-08-05 Thread Gourav Sengupta
Hi, we are trying to migrate some of the data lake pipelines to run in SPARK 3.x, where as the dependent pipelines using those tables will be still running in SPARK 2.4.x for sometime to come. Does anyone know of any issues that can happen: 1. when reading Parquet files written in 3.1.x in SPARK

Re: How can transform RDD[Seq[String]] to RDD[ROW]

2021-08-05 Thread suresh kumar pathak
May be this link will help you. https://stackoverflow.com/questions/41898144/convert-rddstring-to-rddrow-to-dataframe-spark-scala On Thu, Aug 5, 2021 at 12:46 PM igyu wrote: > val ftpUrl = > "ftp://test:test@ip:21/upload/test/_temporary/0/_temporary/task_2019124756_0002_m_00_0/*; > val

How can transform RDD[Seq[String]] to RDD[ROW]

2021-08-05 Thread igyu
val ftpUrl = "ftp://test:test@ip:21/upload/test/_temporary/0/_temporary/task_2019124756_0002_m_00_0/*; val rdd = spark.sparkContext.wholeTextFiles(ftpUrl) val value = rdd.map(_._2).map(csv=>csv.split(",").toSeq) val schemas = StructType(List( new StructField("id",