umehrot2 commented on issue #1371: [SUPPORT] Upsert for S3 Hudi dataset with large partitions takes a lot of time in writing URL: https://github.com/apache/incubator-hudi/issues/1371#issuecomment-594955784 @vinothchandar this is exactly what I was talking about. This easily becomes a bottleneck as the driver spends time filtering out the files that it gets from `InMemoryFileIndex` as filtering is not distributed. My suggestion here is, at the time of ingestion we just return an `EmptyRelation` once **HoodieSparkSqlWriter** has done its job, because write now we end up creating a relation even at write time using parquet data source which is really not necessary for our use-case. I have been testing this internally for the past week.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
