[ https://issues.apache.org/jira/browse/HUDI-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-1896: --------------------------------- Description: As discussed in HUDI-1723, we need a better implementation for Cloud object storage like AWS S3 or GCS, leveraging on change notification. Also consider [https://docs.databricks.com/spark/latest/structured-streaming/sqs.html] We need to look into current *DFSSource classes and see if we can add a new `DFSPathSelector` implementation, that fetech new files on cloud storage after a given point in time. The timestamp based approach used by existing path selector, largely works, but has corner cases as mentioned in HUDI-1723 was: As discussed in HUDI-1723, we need a better implementation for Cloud object storage like AWS S3 or GCS, leveraging on change notification. Also consider https://docs.databricks.com/spark/latest/structured-streaming/sqs.html > [UMBRELLA] Implement DeltaStreamer Source for cloud object stores > ----------------------------------------------------------------- > > Key: HUDI-1896 > URL: https://issues.apache.org/jira/browse/HUDI-1896 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer > Reporter: Raymond Xu > Priority: Critical > > As discussed in HUDI-1723, we need a better implementation for Cloud object > storage like AWS S3 or GCS, leveraging on change notification. > Also consider > [https://docs.databricks.com/spark/latest/structured-streaming/sqs.html] > > We need to look into current *DFSSource classes and see if we can add a new > `DFSPathSelector` implementation, that fetech new files on cloud storage > after a given point in time. The timestamp based approach used by existing > path selector, largely works, but has corner cases as mentioned in HUDI-1723 -- This message was sent by Atlassian Jira (v8.3.4#803005)