glory9211 opened a new issue, #5952:
URL: https://github.com/apache/hudi/issues/5952

   ### Optimize DeltaStreamer S3EventSource Implementation for reading files in 
parallel fashion
   
   Following this 
[guide](https://hudi.apache.org/docs/0.10.1/quick-start-guide/) we are trying 
to read incoming event files from Hudi DeltaStreamer 
[S3EventsSource.java](https://github.com/apache/hudi/blob/6456bd3a5199d60ff55d6c576e139025a1c940c7/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java#L77)
 which uses spark default _spark.read.format('json').load()_ function to read 
the paths from meta table created by consuming SQS which holds the filepaths 
events.
   
   In case of a huge number of incremental loads i.e. 50,000 small JSON files 
are read using _spark.read.format().load()_ in a non-parallel fashion on Driver 
Node which takes a lot of time.
   
   **Feature Request**
   
   Instead of passing an Array of Filepaths to the 
_spark.read.format().load()_, we can perfrom the following steps we can use the 
power of spark parallelism better.
   
   1. Convert the Filepath Array to RDD using sc.parallelize()
   2. Create a function def readContents(file): return file.read.content
   3. Call my_rdd.flatMap(readContents) function to read the contents of the 
files in a parallel method
   
   The Inspiration source of this solution along with many other methods and 
performance benchmarks is 
[here](https://joshua-robinson.medium.com/sparks-missing-parallelism-loading-large-datasets-6746906899f5)
   
   **Documentation Update**
   
   In case someone stumbles into the same problem of reading large number of 
files using the original guide, we should update the heading [Conclusion and 
Future 
Work](https://hudi.apache.org/blog/2021/08/23/s3-events-source/#conclusion-and-future-work)
 mentioning this solution or workaround


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to