rohitsinha54 commented on PR #32650: URL: https://github.com/apache/beam/pull/32650#issuecomment-2400436534
> > > Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others? > > > > > > Found actually this is impossible. Lineage is a wrapper of StringSet, whose actual implementation is DelegateStringSet, backed by runner's StringSetData. In Lineage class there is no visibility of the current stringset size, and even if we add track (e.g. a static int) of how many strings we put into Lineage, we do not know the current set size as we do not know if the newly added element already exist in the set > > Thank you for considering and exploring this. This seem like correct description of current implementation. We can incrementally add more to support the above need. > > Here is what I am thinking, StringSet can expose a an API size() which gives the current size of StringSet metric. This will of course be backed by DelegateStringSet and runner's StringSetData. This will allow Lineage class to get visibility into the current size. Based on the current size Lineage class enforce a limit. Generically other consumer/user who uses StringSet metric might want to know the current size of the Set. > > At large beside the two points mentioned in ([#32650 (comment)](https://github.com/apache/beam/pull/32650#discussion_r1788181568)) I believe a better design will be to have StringSet metric expose APIs and feature enabling consumers such as Lineage to use the metric for their use cases (in this case size limit) (bottom up) rather than top down wherein the use case of consumer is built in the metric itself. We want the beam's core metric to be generic. > > Also the size limit of metric is a Dataflow specific limitation based of limit of ReportWorkItem/GFE. Other runner might or might not have it. We should avoid coupling it with Beam's core metric. We discussed this offline. Here is the summary of the discussion for future. If size is exposed based on the StringSetData then the size will be for that specific bundle since one metric may have multiple metrics containers (backed by different stringsetdata). This can confuse user if exposed a api level and also breaks the abstraction which Beam as a programming model strives to provide. The current size enforcement also has this issue but since it is not exposed to user to consume. the stringsetdata limit happens to work with FileIO because currently FileIO report source and sink metrics in a loop, so it happens to write to single metrics container on a single worker. Hence the size enforcement is tied to FileIO implementation. So this is not an ideal fix but we need to solve the FileIO large file issue somehow. I will propose to drop the limit enforcement on string set completely in anyway. We will limit the number of reported files in FileIO itself to ensure we do not hit it. This will avoid FileIO implication and implementation being tied to a metric type. See comment. https://github.com/apache/beam/pull/32662#issuecomment-2400436239 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
