Re: [PR] Enforce a size limit on StringSetData [beam]

via GitHub Tue, 08 Oct 2024 11:57:34 -0700


rohitsinha54 commented on PR #32650:
URL: https://github.com/apache/beam/pull/32650#issuecomment-2400436534


   > > > Can we move the size enforcement to Lineage class rather than forcing 
it for a particular metric type only which makes it inconsistent with others?
   > > 
   > > 
   > > Found actually this is impossible. Lineage is a wrapper of StringSet, 
whose actual implementation is DelegateStringSet, backed by runner's 
StringSetData. In Lineage class there is no visibility of the current stringset 
size, and even if we add track (e.g. a static int) of how many strings we put 
into Lineage, we do not know the current set size as we do not know if the 
newly added element already exist in the set
   > 
   > Thank you for considering and exploring this. This seem like correct 
description of current implementation. We can incrementally add more to support 
the above need.
   > 
   > Here is what I am thinking, StringSet can expose a an API size() which 
gives the current size of StringSet metric. This will of course be backed by 
DelegateStringSet and runner's StringSetData. This will allow Lineage class to 
get visibility into the current size. Based on the current size Lineage class 
enforce a limit. Generically other consumer/user who uses StringSet metric 
might want to know the current size of the Set.
   > 
   > At large beside the two points mentioned in ([#32650 
(comment)](https://github.com/apache/beam/pull/32650#discussion_r1788181568)) I 
believe a better design will be to have StringSet metric expose APIs and 
feature enabling consumers such as Lineage to use the metric for their use 
cases (in this case size limit) (bottom up) rather than top down wherein the 
use case of consumer is built in the metric itself. We want the beam's core 
metric to be generic.
   > 
   > Also the size limit of metric is a Dataflow specific limitation based of 
limit of ReportWorkItem/GFE. Other runner might or might not have it. We should 
avoid coupling it with Beam's core metric.
   
   We discussed this offline. 
   
   Here is the summary of the discussion for future. 
   If size is exposed based on the StringSetData then the size will be for that 
specific bundle since  one metric may have multiple metrics containers (backed 
by different stringsetdata). This can confuse user if exposed a api level and 
also breaks the abstraction which Beam as a programming model strives to 
provide.
   
   The current size enforcement also has this issue but since it is not exposed 
to user to consume. the stringsetdata limit happens to work with FileIO because 
currently FileIO report source and sink metrics in a loop, so it happens to 
write to single metrics container on a single worker. Hence the size 
enforcement is tied to FileIO implementation. So this is not an ideal fix but 
we need to solve the FileIO large file issue somehow. 
   
   I will propose to drop the limit enforcement on string set completely in 
anyway. We will limit the number of reported files in FileIO itself to ensure 
we do not hit it. This will avoid FileIO implication and implementation being 
tied to a metric type. 
   
   See comment. 
https://github.com/apache/beam/pull/32662#issuecomment-2400436239
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Enforce a size limit on StringSetData [beam]

Reply via email to