milenkovicm commented on code in PR #20037:
URL: https://github.com/apache/datafusion/pull/20037#discussion_r2733812185


##########
datafusion/proto/src/physical_plan/mod.rs:
##########
@@ -3661,7 +3664,44 @@ struct DataEncoderTuple {
     pub blob: Vec<u8>,
 }
 
-pub struct DefaultPhysicalProtoConverter;
+/// Default implementation of [`PhysicalProtoConverterExtension`] that provides
+/// expression deduplication during deserialization.
+///
+/// During serialization, the Arc pointer address of each expression is 
embedded
+/// in the protobuf as `expr_arc_id`. During deserialization, if an expression
+/// with the same `expr_arc_id` has been seen before, the cached Arc is 
returned
+/// instead of creating a new one. This enables expression sharing and can
+/// significantly reduce memory usage for plans with duplicate expressions
+/// (e.g., large IN lists).
+///
+/// # Important: Scope of Deduplication
+///
+/// The `expr_arc_id` is only valid as a deduplication key **within a single
+/// serialized plan from a single process**. Arc pointer addresses can collide:
+/// - Different processes may allocate Arcs at the same address
+/// - The same process may reuse addresses after deallocation
+///
+/// Therefore, you **must create a fresh `DefaultPhysicalProtoConverter` 
instance
+/// for each plan you deserialize**. Do not reuse the same converter instance
+/// across multiple plans from different sources, as this could incorrectly
+/// deduplicate unrelated expressions that happen to share the same pointer 
address.
+#[derive(Default)]
+pub struct DefaultPhysicalProtoConverter {
+    /// Cache for expression deduplication during deserialization.
+    /// Maps expr_arc_id (the original Arc pointer address) to the 
deserialized expression.
+    ///
+    /// This cache should only be used for a single plan deserialization.
+    /// Create a new converter instance for each plan to avoid cross-plan 
collisions.
+    dedup_cache: RwLock<HashMap<u64, Arc<dyn PhysicalExpr>>>,

Review Comment:
   I still believe this should not be pushed as default  for all users. If it 
is needed for distribution, it should be implemented by specific distribution 
engine 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to