phet commented on PR #3667:
URL: https://github.com/apache/gobblin/pull/3667#issuecomment-1505582228

   > Oh, I meant that I'll aggregate on the task level but let the client 
handle aggregation on the dataset level, since it is rare to have a single 
pipeline manage say tens of thousands of datasets.
   
   yes, I agree, but think of it less ITO efficiency and how many ops when 
performing aggregation, than of convenience: even if there are only 5, it still 
requires a piece of code to total them per-event, within every downstream 
analysis considering the job-level total.  I believe this to be most analyses...
   
   so although I generally recommend against de-normalized data structures, I 
see the ready-to-use convenience out weighing the design principle.  consider 
the simple problem of counting by user, who's copying > 1TB in a job.  it's 
trivially simple to filter by bytes... but only so long with a job-level sum.  
if the event has only task-level aggregation, the friction increases 
considerably, whether in SQL or kusto.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to