phet commented on PR #3667: URL: https://github.com/apache/gobblin/pull/3667#issuecomment-1505582228
> Oh, I meant that I'll aggregate on the task level but let the client handle aggregation on the dataset level, since it is rare to have a single pipeline manage say tens of thousands of datasets. yes, I agree, but think of it less ITO efficiency and how many ops when performing aggregation, than of convenience: even if there are only 5, it still requires a piece of code to total them per-event, within every downstream analysis considering the job-level total. I believe this to be most analyses... so although I generally recommend against de-normalized data structures, I see the ready-to-use convenience out weighing the design principle. consider the simple problem of counting by user, who's copying > 1TB in a job. it's trivially simple to filter by bytes... but only so long with a job-level sum. if the event has only task-level aggregation, the friction increases considerably, whether in SQL or kusto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
