[GitHub] [hudi] yihua commented on issue #5952: [SUPPORT] HudiDeltaStreamer S3EventSource SQS optimize for reading large number of files in parallel fashion

GitBox Tue, 28 Jun 2022 09:57:39 -0700


yihua commented on issue #5952:
URL: https://github.com/apache/hudi/issues/5952#issuecomment-1168986618


   Thanks for the feature request.
   
   The referenced code you mentioned in `S3EventsSource` converts the Json 
records already in Dataset to Dataframe for further processing.  Do you 
actually refer to the optimization of reading events from SQS (which should not 
actually involve file reading)?
   ```
   Dataset<String> eventRecords = 
sparkSession.createDataset(selectPathsWithLatestSqsMessage.getLeft(), 
Encoders.STRING());
         return Pair.of(
             Option.of(sparkSession.read().json(eventRecords)),
             selectPathsWithLatestSqsMessage.getRight());
   ```
   
   Feel free to create a Jira ticket for the feature request and I encourage 
you to put up a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on issue #5952: [SUPPORT] HudiDeltaStreamer S3EventSource SQS optimize for reading large number of files in parallel fashion

Reply via email to