rohitsinha54 commented on PR #32662: URL: https://github.com/apache/beam/pull/32662#issuecomment-2400436239
I think we should make FileIO reported lineage fit in limit by itself without dependency on StringSet metric type enforcing a limit as there is no clean and short way to do so see comment: We need to handle cases - Write: Sharded Files: This is already handled in this PR. When sharded file size is greater than 100 we report the directory above (guaranteed to be common). - Read: Wildcard and large number of files under a dir: For both these cases we can take a simple approach for now. We look at files if > 100 then we look one level up if unique dir > 100 we report bucket only (we avoid traversing further up to find a common ancestor under limit 100) if < 100 we report the dir path. This gives us accurate visibility in lineage in most cases except when files are spread across many folders at different level in which case we only get to know bucket which is fine. On customer demand we can improve this in later releases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
