damccorm commented on code in PR #37532:
URL: https://github.com/apache/beam/pull/37532#discussion_r3010122264


##########
sdks/python/apache_beam/transforms/util.py:
##########
@@ -1319,6 +1320,274 @@ def expand(self, pcoll):
               self._batch_size_estimator, self._element_size_fn))
 
 
+def _default_element_size_fn(element: Any) -> int:
+  """Default element size function that tries len(), falls back to 1.
+
+  This function attempts to compute the size of an element using len().
+  If the element does not support len() (e.g., integers), it falls back to 1.
+
+  Args:
+    element: The element to compute the size of.
+
+  Returns:
+    The size of the element, or 1 if len() is not supported.
+  """
+  try:
+    return len(element)
+  except TypeError:
+    return 1

Review Comment:
   We should probably warn in this case since this is almost certainly not what 
the user desired. To avoid warning on every element, we can make this function 
a member of the `_SortAndBatchElementsDoFn` and use a member variable to track 
if we've already warned. That way we only warn once per DoFn instance



##########
sdks/python/apache_beam/testing/benchmarks/sort_and_batch_benchmark.py:
##########
@@ -0,0 +1,650 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Benchmark: BatchElements vs SortAndBatchElements (weight-based splitting).
+
+Compares two batching strategies for variable-length inference workloads:
+
+- Baseline (BatchElements): fixed-count chunking, ignores element sizes.
+- Stateless (SortAndBatchElements): within each bundle, sorts elements
+  by size, then splits batches using max_batch_weight so that each batch
+  has a bounded total weight.  The improvement comes from *changing batch
+  boundaries* (weight-based splitting), NOT from sorting alone -- sorting
+  within fixed boundaries yields 0% gain (verified by strict-control).

Review Comment:
   > The improvement comes from *changing batch
   >  boundaries* (weight-based splitting), NOT from sorting alone
   
   This is surprising to me - doesn't sorting the batch functionally change the 
batch boundaries as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to