cashmand commented on PR #46831: URL: https://github.com/apache/spark/pull/46831#issuecomment-2358485271
@Zouxxyy, thanks, these are great questions, which we don't have clear answers for yet, but I'll give you my high-level thoughts. > 1. Will shredded variant be a new type? Because I see that it is currently a nested and changing Struct type, it is a bit difficult to imagine how to describe it. The intent is for it to not be an entirely new type. For the purposes of describing the desired read/write schema to a data source, I think we might want to do something like extend the current VariantType to specify a shredding schema, but I don't think most of Spark should need to consider it to be a distinct type. > 2. For the write side, how is the shredding schema generated adaptively? From the description in the document, it looks dynamic, is it at the table level / file level / or even rowGroup level? And, I see that many layers of nesting are currently designed, does this have an impact on the write overhead. The intent is to allow it to vary at the file level. (I don't think row group level is an option for Parquet, since Parquet metadata has a single schema per-file.) The exact mechanism is still up in the air. We could start as simply as having a single user-specified schema per write job, but ultimately I think we'd like to either see Spark determine a shredding schema dynamically, or provide the flexibility in the data source API to allow connectors to determine a shredding schema. > 3. For the read side, if it is a file-level schema, how should spark integrate it when reading. For example, if we want to obtain a certain path, but if the schemas of different files are different, how should we determine the physical plan. Also a tough question. I think we'll either need the Parquet reader to handle the per-file manipulation, or provide an API to allow data sources to inject per-file expressions to produce the data needed by the query. (This could be useful in other scenarios like type widening, which might have data source specific requirements.) We're still looking into what the best approach is here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
