p-powell opened a new issue, #5351:
URL: https://github.com/apache/hudi/issues/5351
Concerned about performance. How long should the following mocked-up sample
take to write to s3? There are 1,369,765 records and 308 columns. It is taking
~10.5min running in docker container on an t2.xlarge ec2 instance using the
datamechanics/spark:3.2.0-hadoop-3.3.1-java-11-scala-2.12-python-3.8-latest
image. Any suggestions how to increate performance. The sample file generated
below is just to illustrate our issue.
Steps to reproduce the behavior:
1. Start docker container
docker run -it
datamechanics/spark:3.2.0-hadoop-3.3.1-java-11-scala-2.12-python-3.8-latest
/bin/bash
2. Download sample file
cd /tmp
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv
3. Start spark shell
/opt/spark/bin/spark-shell --packages
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hadoop:hadoop-aws:2.7.3
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
--driver-memory 16g
4. run the following code(replace {__bucket___} with a valid bucket):
mport org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var df =
spark.read.option("header","true").csv(file:///tmp/yellow_tripdata_2021-01.csv);
var a = 0;
var b = 0;
// Just constructing a table for testing.
var cols = df.columns;
var num_cols = cols.length;
// duplicating colums to make a larger dataset
for( a <- 1 to 16; b <- 0 to num_cols-1){
var col_name = cols(b);
var new_col_name = col_name + "_" + a;
df = df.withColumn(new_col_name, col(col_name));
};
// going to written to one partition
val w = Window.partitionBy(lit('A')).orderBy(lit('A'))
var df_id = df.withColumn("_id",
row_number().over(w)).withColumn("partpath", lit('N'))
val tableName = "hudi_test"
val basePath = "s3a://{__bucket___}/hudi_test_table"
val starttime = System.nanoTime
df_id.write.format("hudi").
option(PRECOMBINE_FIELD_OPT_KEY, "_id").
option(RECORDKEY_FIELD_OPT_KEY, "_id").
option(PARTITIONPATH_FIELD_OPT_KEY, "partpath").
option("hoodie.datasource.write.operation","upsert").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
val duration = (System.nanoTime - starttime) / 1e9d
print("write time:" + duration )
**Expected behavior**
Not sure if this is the expected performance that we can expect with this
instance size. Or if there are suggestions on how to increase the performance.
**Environment Description**
* Hudi version : 0.8.0 / 0.9.0
* Spark version : 3.0.1
* Hive version :
* Hadoop version : 3.3.1
* Storage (HDFS/S3/GCS..) :
* Running on Docker? (yes/no) : yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]