cashmand commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2358485271

   @Zouxxyy, thanks, these are great questions, which we don't have clear 
answers for yet, but I'll give you my high-level thoughts.
   
   > 1. Will shredded variant be a new type? Because I see that it is currently 
a nested and changing Struct type, it is a bit difficult to imagine how to 
describe it.
   
   The intent is for it to not be an entirely new type. For the purposes of 
describing the desired read/write schema to a data source, I think we might 
want to do something like extend the current VariantType to specify a shredding 
schema, but I don't think most of Spark should need to consider it to be a 
distinct type.
   
   > 2. For the write side, how is the shredding schema generated adaptively? 
From the description in the document, it looks dynamic, is it at the table 
level / file level / or even rowGroup level? And, I see that many layers of 
nesting are currently designed, does this have an impact on the write overhead.
   
   The intent is to allow it to vary at the file level. (I don't think row 
group level is an option for Parquet, since Parquet metadata has a single 
schema per-file.) The exact mechanism is still up in the air. We could start as 
simply as having a single user-specified schema per write job, but ultimately I 
think we'd like to either see Spark determine a shredding schema dynamically, 
or provide the flexibility in the data source API to allow connectors to 
determine a shredding schema.
   
   > 3. For the read side, if it is a file-level schema, how should spark 
integrate it when reading. For example, if we want to obtain a certain path, 
but if the schemas of different files are different, how should we determine 
the physical plan.
   
   Also a tough question. I think we'll either need the Parquet reader to 
handle the per-file manipulation, or provide an API to allow data sources to 
inject per-file expressions to produce the data needed by the query. (This 
could be useful in other scenarios like type widening, which might have data 
source specific requirements.) We're still looking into what the best approach 
is here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to