umehrot2 commented on issue #1371: [SUPPORT] Upsert for S3 Hudi dataset with 
large partitions takes a lot of time in writing
URL: https://github.com/apache/incubator-hudi/issues/1371#issuecomment-594955784
 
 
   @vinothchandar this is exactly what I was talking about. This easily becomes 
a bottleneck as the driver spends time filtering out the files that it gets 
from `InMemoryFileIndex` as filtering is not distributed. My suggestion here 
is, at the time of ingestion we just return an `EmptyRelation` once 
**HoodieSparkSqlWriter** has done its job, because write now we end up creating 
a relation even at write time using parquet data source which is really not 
necessary for our use-case. I have been testing this internally for the past 
week.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to