I forgot to mention important part that I'm issuing same query to both parquets - selecting only one column:
df.select(sum('amount)) BR, Tomas št 19. 9. 2019 o 18:10 Tomas Bartalos <tomas.barta...@gmail.com> napísal(a): > Hello, > > I have 2 parquets (each containing 1 file): > > - parquet-wide - schema has 25 top level cols + 1 array > - parquet-narrow - schema has 3 top level cols > > Both files have same data for given columns. > When I read from parquet-wide spark reports* read 52.6 KB*, from > parquet-narrow *only 2.6 KB*. > For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say > reading narrow parquet is much faster. > > Since schema pruning is applied I *expected to get similar results* for > both scenarios (timing and amount of data read). > What do you think is the reason for such a big difference, is there any > tuning I can do ? > > Thank you, > Tomas >