pvary commented on PR #14297:
URL: https://github.com/apache/iceberg/pull/14297#issuecomment-3431151392

   Thanks for the explanation, @huaxingao! I see several possible workarounds 
for the DataWriterFactory serialization issue, but I have some more fundamental 
concerns about the overall approach.
   I believe shredding should be driven by future reader requirements rather 
than by the actual data being written. Ideally, it should remain relatively 
stable across data files within the same table and originate from a writer job 
configuration—or even better, from a table-level configuration.
   
   Even if we accept that the written data should dictate the shredding logic, 
Spark’s implementation—while dependent on input order—is at least somewhat 
stable. It drops rarely used fields, handles inconsistent types, and limits the 
number of columns.
   I understand this is only a PoC implementation for shredding, but I’m 
concerned that the current simplifications make it very unstable. If I’m 
interpreting correctly, the logic infers the type from the first occurrence of 
each field and creates a column for every field. This could lead to highly 
inconsistent column layouts within a table, especially in IoT scenarios where 
multiple sensors produce vastly different data.
   Did I miss anything?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to