rubenssoto opened a new issue #1878:
URL: https://github.com/apache/hudi/issues/1878
Hi, how are you?
Im using EMR 5.30.1, spark 2.4.5, hudi 0.5.2 and my data is store in S3.
Since today Im trying to migrate some of our datasets in production to
apache hudi, Im having problems with the first, could you help me please?
It is a small dataset, 26gb distributed by 89 parquet files. Im reading the
data with structured streaming, reading 4 files per trigger, when I write the
stream in a regular parquet, works, but if I use hudi doenst work.
This is my hudi options, I tryed with or without shuffle options, I need
files more than 500mb with max 1000mb
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'event_date',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'insert',
'hoodie.datasource.write.precombine.field': 'LineCreatedTimestamp',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.parquet.small.file.limit': 500000000,
'hoodie.parquet.max.file.size': 800000000,
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': tableName,
'hoodie.datasource.hive_sync.database': 'datalake_raw',
'hoodie.datasource.hive_sync.partition_fields': 'event_date',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.jdbcurl':'jdbc:hive2://<EMR HOST>:10000',
'hoodie.insert.shuffle.parallelism': 20,
'hoodie.upsert.shuffle.parallelism': 20
}
My read and write functions:
def read_parquet_stream(spark_session, read_folder_path, data_schema,
max_files_per_trigger):
spark = spark_session
df = spark \
.readStream \
.option("maxFilesPerTrigger", max_files_per_trigger) \
.schema(data_schema) \
.parquet(read_folder_path)
return df
def write_hudi_dataset_stream(spark_data_frame, checkpoint_location_folder,
write_folder_path, hudi_options):
df_write_query = spark_data_frame \
.writeStream \
.options(**hudi_options) \
.trigger(processingTime='20 seconds') \
.outputMode('append') \
.format('hudi')\
.option("checkpointLocation",
checkpoint_location_folder) \
.start(write_folder_path)
df_write_query.awaitTermination()
I caught some errors:
Job aborted due to stage failure: Task 11 in stage 2.0 failed 4 times, most
recent failure: Lost task 11.3 in stage 2.0 (TID 53,
ip-10-0-87-171.us-west-2.compute.internal, executor 9): ExecutorLostFailure
(executor 9 exited caused by one of the running tasks) Reason: Container killed
by YARN for exceeding memory limits. 6.3 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My cluster is small, but the data is small too
master: 4 cores and 16gb ram
nodes: 2 nodes with 4 cores and 16gb each
If I write stream in a regular parquet takes 38min to finish the job, but in
hudi it have been passed more then one hour and half and job haven't finished
yet.
Could you help me? I need to put this job in production as soon as possible.
Thank you Guys!!!
<img width="1680" alt="Captura de Tela 2020-07-24 às 17 26 45"
src="https://user-images.githubusercontent.com/36298331/88447482-27be5000-ce0a-11ea-889a-c5f1042fbe98.png">
<img width="1680" alt="Captura de Tela 2020-07-25 às 00 03 26"
src="https://user-images.githubusercontent.com/36298331/88447495-4cb2c300-ce0a-11ea-9482-7965d7646476.png">
<img width="1680" alt="Captura de Tela 2020-07-25 às 00 05 00"
src="https://user-images.githubusercontent.com/36298331/88447510-81267f00-ce0a-11ea-9311-38a395390d6b.png">
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]