rohitsinha54 commented on PR #32662:
URL: https://github.com/apache/beam/pull/32662#issuecomment-2400436239

   I think we should make FileIO reported lineage fit in limit by itself 
without dependency on StringSet metric type enforcing a limit as there is no 
clean and short way to do so see comment: 
   
   We need to handle cases
   - Write: Sharded Files: This is already handled in this PR. When sharded 
file size is greater than 100 we report the directory above (guaranteed to be 
common). 
   - Read: Wildcard and large number of files under a dir: For both these cases 
we can take a simple approach for now. We look at files if > 100 then we look 
one level up if unique dir > 100 we report bucket only (we avoid traversing 
further up to find a common ancestor under limit 100) if < 100 we report the 
dir path.
   
   This gives us accurate visibility in lineage in most cases except when files 
are spread across many folders at different level in which case we only get to 
know bucket which is fine. On customer demand we can improve this in later 
releases. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to