MikeBuh commented on issue #3751:
URL: https://github.com/apache/hudi/issues/3751#issuecomment-1108731683

   Hi @codope I did a rework and got a more stable version to work however do 
still have some things that I would like to clarify. Before going into the 
details please also consider: 
   
   **Hudi Changes**
   - Recently we have upgraded to Hudi 0.10.0 and thus all our tables now have 
been updated accordingly 
   - The table(s) in question use BLOOM index (previously this was 
GLBOAL_BLOOM) 
   - _hoodie.payload.ordering.field_ has been set to the same value of 
_hoodie.datasource.write.precombine.field_
   - _hoodie.upsert.shuffle.parallelism_ has been set to same value of 
_spark.sparkContext.defaultParallelism.toString_
   
   **Real Time Flow**
   - We have a real-time flow consuming, processing, and persisting data to 
Hudi using Spark structured streaming
   - In the most common scenario the flow reads 1 or 2 files of avro data each 
around 25MB (compacted via NiFi)
   - This flow has been successfully running for a while but we think 
performance can be improved
   - Our question at this point is if all this resources (seen hereunder) are 
needed given the small data input size that we have
   > num-executors 3 
   > executor-cores 3 
   > executor-memory 5400m 
   > spark.driver.memoryOverhead=1024m 
   > spark.sql.shuffle.partitions=18 
   > spark.default.parallelism=18 
   
    **Reload Flow**
   - Apart from the real-time flow we sometimes require to reload data in a 
separate flow, pausing the real-time one. 
   - The input data for this flow is as originally described (Avro Files | 25MB 
each | 250 / 6GB per day)
   - In our opinion this flow is taking too long to execute given the amount of 
resources (see hereunder) and size of data
   - Is there any recommended parameter and/or option to look into which might 
drastically improve performance? 
   
   > num-executors 5
   > executor-cores 5
   > executor-memory 7900m 
   > spark.driver.memoryOverhead=1020m 
   > spark.sql.shuffle.partitions=50 
   > spark.default.parallelism=50
   
   
   Should you require any additional details please reach out to us so that we 
may provide them. Thanks once again and we look forward to your reply.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to