cashmand commented on PR #48779: URL: https://github.com/apache/spark/pull/48779#issuecomment-2465413855
> So If I understand correctly, the shredding write chain may be like this: Get the expected shredded schema (DataType) through some ways (sampling or justs user defined?) -> parquet writer accepts the variant + shredded schema -> cast to shredded InternalRow -> write to parquet file (the actual col type is the group type corresponding to the shredded dataType). Yes, exactly. For initial implementation and testing, I plan to set the schema explicitly. I think sampling makes sense as a better user experience in the long term, but needs some thought about the best way to implement it. > From this perspective, consider the integration with the lake format, lake format generally has its own reader or writer. I feel like that the lake format may still receive the raw variant data, and the same shredding logic must be implemented in the format's wrtier. If you have any other ideas on it, I'd love to hear your perspective, thanks! What do you mean by "lake format"? Are you referring to formats like Iceberg or Delta? I made an effort in this PR to keep the shredding logic in common/variant, and created an interface (ShreddedResult and ShreddedResultBuilder) that Spark implements to construct the InternalRow that it uses in its Parquet writer. The intent is that other writers could implement the interface to match their data types, and still reuse the same code for the shredding logic. Eventually, we can separate common/variant into its own Java library (and maybe move it to the Parquet project, which is where the Variant spec is moving to), to make it easier for other JVM-based writers to use the Variant implementation outside of Spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
