Github user liancheng commented on the issue:
https://github.com/apache/spark/pull/13137
@maropu Just had an offline discussion with @yhuai. So this case is a
little bit different from #13444. In #13444, the number of leaf files is
unknown before issuing the job, and each task may take one or more directories
and further list them recursively, thus increasing parallelism is potentially
useful. Plus that listing leaf files may suffer from data skew (one directory
containing significantly more files than others).
In the Parquet schema reading case, the file number is already known, and
there's no data skew problem.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]