glory9211 opened a new issue, #5952: URL: https://github.com/apache/hudi/issues/5952
### Optimize DeltaStreamer S3EventSource Implementation for reading files in parallel fashion Following this [guide](https://hudi.apache.org/docs/0.10.1/quick-start-guide/) we are trying to read incoming event files from Hudi DeltaStreamer [S3EventsSource.java](https://github.com/apache/hudi/blob/6456bd3a5199d60ff55d6c576e139025a1c940c7/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java#L77) which uses spark default _spark.read.format('json').load()_ function to read the paths from meta table created by consuming SQS which holds the filepaths events. In case of a huge number of incremental loads i.e. 50,000 small JSON files are read using _spark.read.format().load()_ in a non-parallel fashion on Driver Node which takes a lot of time. **Feature Request** Instead of passing an Array of Filepaths to the _spark.read.format().load()_, we can perfrom the following steps we can use the power of spark parallelism better. 1. Convert the Filepath Array to RDD using sc.parallelize() 2. Create a function def readContents(file): return file.read.content 3. Call my_rdd.flatMap(readContents) function to read the contents of the files in a parallel method The Inspiration source of this solution along with many other methods and performance benchmarks is [here](https://joshua-robinson.medium.com/sparks-missing-parallelism-loading-large-datasets-6746906899f5) **Documentation Update** In case someone stumbles into the same problem of reading large number of files using the original guide, we should update the heading [Conclusion and Future Work](https://hudi.apache.org/blog/2021/08/23/s3-events-source/#conclusion-and-future-work) mentioning this solution or workaround -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
