Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/5141#issuecomment-86443584
  
    hi @liancheng , thanks for reviewing. 
    
    One small query on a separate note, 
    currently in the implementation of mergeMetastoreParquetSchema, I see that 
for finding out the merged parquetSchema, part files from all the partitions 
are being used. Does this scale ? What happens if we have millions of 
partitions, doesn't this slow down every read query even if only small number 
of partitions are being referred ? 
    Was wondering if we can change this to get a unified schema just from the 
referred partitions ? (Though in that case I think we will need to have a 
summary file containing all the columns in the base path of the table)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to