bwu2 opened a new issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328
 
 
   **Describe the problem you faced**
   When we upsert data into Hudi, we're finding that the job just hangs in some 
cases. Specifically, we have an ETL pipeline where we re-ingest a lot of data 
(i.e. we upsert data that already exists in the Hudi table). When the 
proportion of data that is not new is very high, the Hudi spark job seems to 
hang before writing out the updated table.
   
   Note that this currently affects 2 of the 80 tables in our ETL pipeline and 
the rest run fine. 
   
   **To Reproduce**
   See gist at: https://gist.github.com/bwu2/89f98e0926374f71c80e4b2fa5089f18
   
   The code there creates a Hudi table with 4m rows. It then upserts another 4m 
rows, 3.5m of which are the same as the original 4m.
   
   Note that bulk parallelism of the initial load is deliberately set to 1 to 
ensure we avoid lots of small files.
   
   Running this code on an EMR cluster (either interactively in a PySpark shell 
or spark-submit) causes the upsert job never to finish, being stuck somewhere 
in the Spark job with description (from the Spark history server):
   `count at HoodieSparkSqlWriter.scala:255` (after the stage `mapToPair at 
HoodieWriteClient.java:492` and before/during the stage `count at 
HoodieSparkSqlWriter.scala:255`).
   
   For a table this small, it shouldn't matter about 
cores/memory/executors/instance type but we have varied these too with no 
success.
   
   **Expected behavior**
   Expected the upsert job to succeed and the total number of rows in the table 
to be 4.5m.
   
   **Environment Description
   Running on EMR 5.29.0 
   * Hudi version : tested on 0.5.0, 0.5.1 and latest build off master
   
   * Spark version : 2.4.4
   
   * Hive version : N/A
   
   * Hadoop version : 2.8.5 (Amazon)
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : NO
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to