mosche commented on PR #22182:
URL: https://github.com/apache/beam/pull/22182#issuecomment-1180436296

   Thanks a million for your thorough review and profiling the code, @lukecwik 
💯 
   
   > I'm worried that the timed portion is being dominated by the benchmark 
code and not the code that we want to measure.
   
   I do agree that this is a huge concern and it's been tricky in this case. 
I've moved as much of the overhead into `setup` to prepare an array of rows for 
each benchmark iteration. I've optimised `setup` code according to your 
suggestion (using `for` loops), but the overhead of iterating over that array 
will remain (the same) during benchmark.
   
   In case of the simple int field that was really significant, thanks for 
catching that. It's interesting to see how costly nested structures are (e.g. 
processNestedIntField) in comparison, in that case the effect was much less.
   
   Switching to `READ_REPEATED` (as you suggested) reduced the proportional 
overhead and removing the nested loop in 
[readField](https://github.com/apache/beam/pull/22182/commits/95c425111d0c0fb5948fe46bbb9ad071b138f355#diff-d1eec5d84b10a3d22d612c37df41ec483b3058034c8436932e77ac808690ed76L134-R149)
 (replaced by `readRowsOnce`, `readRowsRepeatedly`) also helped a bit. But in 
the end the code doesn't do much more than throwing the row into a sink. 
   
   For collection types, I've reduced the size to one in the factory to avoid 
having to iterate over the collections when reading. Nevertheless, given there 
might be collections of rows (with getters itself), I think it's important to 
access these rather than throwing the entire collection into the blackhole.
   
   If you see any possible further optimisations, I'd be glad to hear. 
   To make things easier to follow when profiling, I've replaced the lambdas 
with classes.
   
   Read `int` field once:
   <img width="829" alt="Screenshot 2022-07-11 at 14 19 06" 
src="https://user-images.githubusercontent.com/1401430/178271981-baab50c5-2b9a-4220-8e98-159d4af4837a.png";>
   Read `int` field repeatedly:
   <img width="828" alt="Screenshot 2022-07-11 at 14 19 20" 
src="https://user-images.githubusercontent.com/1401430/178271969-84d4ce9e-9d25-48ad-82d6-90e21dbacc88.png";>
   Read nested `int` field repeatedly:
   <img width="829" alt="Screenshot 2022-07-11 at 14 19 32" 
src="https://user-images.githubusercontent.com/1401430/178271957-e47b054e-970a-4cfa-a41a-b106c37b7de8.png";>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to