Github user saucam commented on the pull request:
https://github.com/apache/spark/pull/5141#issuecomment-86443584
hi @liancheng , thanks for reviewing.
One small query on a separate note,
currently in the implementation of mergeMetastoreParquetSchema, I see that
for finding out the merged parquetSchema, part files from all the partitions
are being used. Does this scale ? What happens if we have millions of
partitions, doesn't this slow down every read query even if only small number
of partitions are being referred ?
Was wondering if we can change this to get a unified schema just from the
referred partitions ? (Though in that case I think we will need to have a
summary file containing all the columns in the base path of the table)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]