MikeBuh commented on issue #3751:
URL: https://github.com/apache/hudi/issues/3751#issuecomment-1108731683
Hi @codope I did a rework and got a more stable version to work however do
still have some things that I would like to clarify. Before going into the
details please also consider:
**Hudi Changes**
- Recently we have upgraded to Hudi 0.10.0 and thus all our tables now have
been updated accordingly
- The table(s) in question use BLOOM index (previously this was
GLBOAL_BLOOM)
- _hoodie.payload.ordering.field_ has been set to the same value of
_hoodie.datasource.write.precombine.field_
- _hoodie.upsert.shuffle.parallelism_ has been set to same value of
_spark.sparkContext.defaultParallelism.toString_
**Real Time Flow**
- We have a real-time flow consuming, processing, and persisting data to
Hudi using Spark structured streaming
- In the most common scenario the flow reads 1 or 2 files of avro data each
around 25MB (compacted via NiFi)
- This flow has been successfully running for a while but we think
performance can be improved
- Our question at this point is if all this resources (seen hereunder) are
needed given the small data input size that we have
> num-executors 3
> executor-cores 3
> executor-memory 5400m
> spark.driver.memoryOverhead=1024m
> spark.sql.shuffle.partitions=18
> spark.default.parallelism=18
**Reload Flow**
- Apart from the real-time flow we sometimes require to reload data in a
separate flow, pausing the real-time one.
- The input data for this flow is as originally described (Avro Files | 25MB
each | 250 / 6GB per day)
- In our opinion this flow is taking too long to execute given the amount of
resources (see hereunder) and size of data
- Is there any recommended parameter and/or option to look into which might
drastically improve performance?
> num-executors 5
> executor-cores 5
> executor-memory 7900m
> spark.driver.memoryOverhead=1020m
> spark.sql.shuffle.partitions=50
> spark.default.parallelism=50
Should you require any additional details please reach out to us so that we
may provide them. Thanks once again and we look forward to your reply.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]