bvaradar commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-663813751
This is a spark tuning issue in general. The slowness is due to memory
pressure and node failures due to it. Atleast in one of the batches, I see task
failures (and retries) during reading from source parquet file itself.
As mentioned in the suggestion "Consider boosting
spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.", you need to
increase spark.yarn.executor.memoryOverhead. You are running 2 executors per
machine with 8GB room for each which may not have lot of room. If you are
trying to compare parquet write with hudi, note that hudi adds metadata fields
which gives incremental pull, indexing and other benefits. If your original
record size is very small and comparable to metadata overhead and your setup is
already close to hitting the limit for parquet write, then you would need to
give more resources.
On a related note, since you are trying to use streaming for bootstrapping
from a fixed source, have you considered using bulk insert or insert (for size
handling) in batch mode which would sort and write the data once. With this
mode of incremental inserting, Hudi would try to increase a small file
generated in the previous batch. This means that it has to read the small file
and apply new insert and write a newer version of the file (which is bigger).
As you can see, more the number of iterations here, the more repeated reads
will happen. Hence, you would benefit by throwing more resources for a
potentially shorter time to do this migration.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]