[GitHub] [gobblin] phet commented on pull request #3667: [GOBBLIN-1806] Submit dataset summary event post commit and integrate them into GaaSObservabilityEvent

via GitHub Wed, 12 Apr 2023 09:32:56 -0700


phet commented on PR #3667:
URL: https://github.com/apache/gobblin/pull/3667#issuecomment-1505582228


   > Oh, I meant that I'll aggregate on the task level but let the client 
handle aggregation on the dataset level, since it is rare to have a single 
pipeline manage say tens of thousands of datasets.
   
   yes, I agree, but think of it less ITO efficiency and how many ops when 
performing aggregation, than of convenience: even if there are only 5, it still 
requires a piece of code to total them per-event, within every downstream 
analysis considering the job-level total.  I believe this to be most analyses...
   
   so although I generally recommend against de-normalized data structures, I 
see the ready-to-use convenience out weighing the design principle.  consider 
the simple problem of counting by user, who's copying > 1TB in a job.  it's 
trivially simple to filter by bytes... but only so long with a job-level sum.  
if the event has only task-level aggregation, the friction increases 
considerably, whether in SQL or kusto.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [gobblin] phet commented on pull request #3667: [GOBBLIN-1806] Submit dataset summary event post commit and integrate them into GaaSObservabilityEvent

Reply via email to