DanielLeens commented on issue #10921:
URL: https://github.com/apache/seatunnel/issues/10921#issuecomment-4560127588

   Thanks for breaking this out into a dedicated follow-up issue.
   
   This looks useful, but I would keep it downstream of the umbrella contract 
work rather than turning "dedicated knowledge sources" into one large parallel 
track too early.
   
   The main reason is that Confluence, Google Drive, and SharePoint will each 
have different source-native version semantics, auth models, and incremental 
discovery behavior. If the unified `DocumentId` / `DocumentHash` / metadata 
contract is not settled first, it will be very easy for each source to drift 
into a slightly different interpretation.
   
   So my suggestion would be:
   
   1. treat the unified document contract as the blocking dependency;
   2. narrow phase 1 to one concrete source MVP first, rather than grouping all 
dedicated sources together immediately;
   3. use that first source to prove metadata projection, version-priority 
rules, and integration with the parse/chunk/embedding/lifecycle-sink path 
before expanding to the next sources.
   
   If this issue continues, it would help to make the first source candidate, 
its versioning model, and the explicit non-goals for phase 1 clearer. That will 
make the proposal much easier for the community to review and split into 
implementable slices.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to