DanielLeens commented on issue #10921: URL: https://github.com/apache/seatunnel/issues/10921#issuecomment-4560127588
Thanks for breaking this out into a dedicated follow-up issue. This looks useful, but I would keep it downstream of the umbrella contract work rather than turning "dedicated knowledge sources" into one large parallel track too early. The main reason is that Confluence, Google Drive, and SharePoint will each have different source-native version semantics, auth models, and incremental discovery behavior. If the unified `DocumentId` / `DocumentHash` / metadata contract is not settled first, it will be very easy for each source to drift into a slightly different interpretation. So my suggestion would be: 1. treat the unified document contract as the blocking dependency; 2. narrow phase 1 to one concrete source MVP first, rather than grouping all dedicated sources together immediately; 3. use that first source to prove metadata projection, version-priority rules, and integration with the parse/chunk/embedding/lifecycle-sink path before expanding to the next sources. If this issue continues, it would help to make the first source candidate, its versioning model, and the explicit non-goals for phase 1 clearer. That will make the proposal much easier for the community to review and split into implementable slices. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
