[GitHub] [hudi] rubenssoto opened a new issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

GitBox Fri, 24 Jul 2020 20:06:28 -0700


rubenssoto opened a new issue #1878:
URL: https://github.com/apache/hudi/issues/1878



   Hi, how are you?
   
   Im using EMR 5.30.1, spark 2.4.5, hudi 0.5.2 and my data is store in S3.
   
   Since today Im trying to migrate some of our datasets in production to 
apache hudi, Im having problems with the first, could you help me please?
   
   It is a small dataset, 26gb distributed by 89 parquet files. Im reading the 
data with structured streaming, reading 4 files per trigger, when I write the 
stream in a regular parquet, works, but if I use hudi doenst work.
   
   This is my hudi options, I tryed with or without shuffle options, I need 
files more than 500mb with max 1000mb
   
   hudi_options = {
     'hoodie.table.name': tableName,
     'hoodie.datasource.write.recordkey.field': 'id',
     'hoodie.datasource.write.partitionpath.field': 'event_date',
     'hoodie.datasource.write.table.name': tableName,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'LineCreatedTimestamp',
     'hoodie.datasource.write.hive_style_partitioning': 'true',
     'hoodie.parquet.small.file.limit': 500000000,
     'hoodie.parquet.max.file.size': 800000000,
     'hoodie.datasource.hive_sync.enable': 'true',
     'hoodie.datasource.hive_sync.table': tableName,
     'hoodie.datasource.hive_sync.database': 'datalake_raw',
     'hoodie.datasource.hive_sync.partition_fields': 'event_date',
     'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
     'hoodie.datasource.hive_sync.jdbcurl':'jdbc:hive2://<EMR HOST>:10000',
     'hoodie.insert.shuffle.parallelism': 20,
     'hoodie.upsert.shuffle.parallelism': 20
   } 
   
   My read and write functions:
   
   def read_parquet_stream(spark_session, read_folder_path, data_schema, 
max_files_per_trigger):
       spark = spark_session
       df = spark \
           .readStream \
           .option("maxFilesPerTrigger", max_files_per_trigger) \
           .schema(data_schema) \
           .parquet(read_folder_path)
       return df
   
   def write_hudi_dataset_stream(spark_data_frame, checkpoint_location_folder, 
write_folder_path, hudi_options):
       df_write_query = spark_data_frame \
                         .writeStream \
                         .options(**hudi_options) \
                         .trigger(processingTime='20 seconds') \
                         .outputMode('append') \
                         .format('hudi')\
                         .option("checkpointLocation", 
checkpoint_location_folder) \
                         .start(write_folder_path)
       df_write_query.awaitTermination()
   
   I caught some errors:
   
   Job aborted due to stage failure: Task 11 in stage 2.0 failed 4 times, most 
recent failure: Lost task 11.3 in stage 2.0 (TID 53, 
ip-10-0-87-171.us-west-2.compute.internal, executor 9): ExecutorLostFailure 
(executor 9 exited caused by one of the running tasks) Reason: Container killed 
by YARN for exceeding memory limits. 6.3 GB of 5.5 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead or disabling 
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
   My cluster is small, but the data is small too
   master: 4 cores and 16gb ram
   nodes: 2 nodes with 4 cores and 16gb each
   
   If I write stream in a regular parquet takes 38min to finish the job, but in 
hudi it have been passed more then one hour and half and job haven't finished 
yet.
   
   Could you help me? I need to put this job in production as soon as possible.
   
   Thank you Guys!!! 
   
   
   <img width="1680" alt="Captura de Tela 2020-07-24 às 17 26 45" 
src="https://user-images.githubusercontent.com/36298331/88447482-27be5000-ce0a-11ea-889a-c5f1042fbe98.png";>
   
   <img width="1680" alt="Captura de Tela 2020-07-25 às 00 03 26" 
src="https://user-images.githubusercontent.com/36298331/88447495-4cb2c300-ce0a-11ea-9482-7965d7646476.png";>
   <img width="1680" alt="Captura de Tela 2020-07-25 às 00 05 00" 
src="https://user-images.githubusercontent.com/36298331/88447510-81267f00-ce0a-11ea-9311-38a395390d6b.png";>
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rubenssoto opened a new issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

Reply via email to