advancedxy commented on code in PR #41192:
URL: https://github.com/apache/spark/pull/41192#discussion_r1197287073
##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala:
##########
@@ -26,14 +26,14 @@ import org.apache.spark.sql.types.{BinaryType, DataType}
private[protobuf] case class CatalystDataToProtobuf(
child: Expression,
messageName: String,
- descFilePath: Option[String] = None,
+ fileDescriptorSetBytesOpt: Option[Array[Byte]] = None,
Review Comment:
> most of the time size would be in kilobytes
Yes. Most of the time, we don't need to bother the size of protobuf
messages. But there are certain cases, huge message and huge protobuf repo is
used, such as ML featuring engineering: they could define thousands, or even
tens of thousands fields in one message.
> Can we do that as an implementation detail (so that users don't need be
aware of it).
Of course, we should do that as an implementation details, users should be
no aware how the bytes are distributed. And we can be adaptive here, only the
size exceeds a threshold, say 512K, we should broadcast instead of serializing.
I just checked the related spark connect impl, seems functions are just
placeholders on the client side, it will work fine with spark connect.
> I will add a comment. Do you think we can do that an enhancement later?
A comment is sufficient for now. Since we can improve it via broadcast, this
enhancement could be addressed in later prs.
> We also need to make sure it is cleaned up.
P.S. if we are using it correctly, it should be cleaned up automatically. A
lot of places uses broadcast.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]