HyukjinKwon commented on a change in pull request #28135:
[SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator
pandas UDF
URL: https://github.com/apache/spark/pull/28135#discussion_r404473227
##########
File path: python/pyspark/worker.py
##########
@@ -357,8 +357,14 @@ def map_batch(batch):
num_output_rows = 0
for result_batch, result_type in result_iter:
num_output_rows += len(result_batch)
- assert is_map_iter or num_output_rows <= num_input_rows[0], \
- "Pandas MAP_ITER UDF outputted more rows than input rows."
+
+ if is_scalar_iter and num_output_rows != num_input_rows[0]:
Review comment:
@WeichenXu123, I think we should either keep this condition or remove.
Currently, it requires each series in the iterator to be the same size - it
could fail fast.
If we remove this, it will require each iterator to be the same size - it
will require to compute the whole iterator to check the size.
I remember you used this for ML side, right? which one does satisfy your
case and general ML cases?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]