[GitHub] [hudi] p-powell opened a new issue, #5351: [SUPPORT]

GitBox Mon, 18 Apr 2022 13:29:25 -0700


p-powell opened a new issue, #5351:
URL: https://github.com/apache/hudi/issues/5351


   Concerned about performance. How long should the following mocked-up sample 
take to write to s3? There are 1,369,765 records and 308 columns. It is taking 
~10.5min running in docker container on an t2.xlarge ec2 instance using the 
datamechanics/spark:3.2.0-hadoop-3.3.1-java-11-scala-2.12-python-3.8-latest 
image. Any suggestions how to increate performance.  The sample file generated 
below is just to illustrate our issue.
   
   Steps to reproduce the behavior:
   
   1.  Start docker container
   docker run -it 
datamechanics/spark:3.2.0-hadoop-3.3.1-java-11-scala-2.12-python-3.8-latest  
/bin/bash
   2. Download sample file
        cd /tmp
   wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv
   3. Start spark shell
        /opt/spark/bin/spark-shell  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hadoop:hadoop-aws:2.7.3
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'  
--driver-memory 16g
   
   4. run the following code(replace {__bucket___} with a valid bucket):
   
   mport org.apache.spark.sql.SaveMode
   import org.apache.spark.sql.functions._
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.spark.sql.expressions.Window
   import org.apache.spark.sql.functions.row_number
   
   var df = 
spark.read.option("header","true").csv(file:///tmp/yellow_tripdata_2021-01.csv);
   
   var a = 0;
   var b = 0;
   
   // Just constructing a table for testing. 
   var cols = df.columns;
   var num_cols = cols.length;
   
   // duplicating colums to make a larger dataset
   for( a <- 1 to 16; b <- 0 to num_cols-1){
            
            var col_name = cols(b);
            var new_col_name = col_name + "_" + a;
            df = df.withColumn(new_col_name, col(col_name));
       };
   
   // going to written to one partition 
   val w = Window.partitionBy(lit('A')).orderBy(lit('A'))
   var df_id = df.withColumn("_id", 
row_number().over(w)).withColumn("partpath", lit('N'))
   
   
   val tableName = "hudi_test"
   val basePath = "s3a://{__bucket___}/hudi_test_table"
   
   val starttime = System.nanoTime
   
   df_id.write.format("hudi").
     option(PRECOMBINE_FIELD_OPT_KEY, "_id").
     option(RECORDKEY_FIELD_OPT_KEY, "_id").
     option(PARTITIONPATH_FIELD_OPT_KEY, "partpath").
     option("hoodie.datasource.write.operation","upsert").
     option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
     option(TABLE_NAME, tableName).
     mode(Overwrite).
     save(basePath)
     
     
   val duration = (System.nanoTime - starttime) / 1e9d
   print("write time:" + duration )
   
   **Expected behavior**
   
   Not sure if this is the expected performance that we can expect with this 
instance size. Or if there are suggestions on how to increase the performance. 
   
   **Environment Description**
   
   * Hudi version : 0.8.0 / 0.9.0
   
   * Spark version : 3.0.1
   
   * Hive version :
   
   * Hadoop version : 3.3.1
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) : yes
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] p-powell opened a new issue, #5351: [SUPPORT]

Reply via email to