[GitHub] spark issue #13137: [SPARK-15247][SQL] Set the default number of partitions ...

liancheng Mon, 13 Jun 2016 13:39:26 -0700

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13137
  
    @maropu Just had an offline discussion with @yhuai. So this case is a 
little bit different from #13444. In #13444, the number of leaf files is 
unknown before issuing the job, and each task may take one or more directories 
and further list them recursively, thus increasing parallelism is potentially 
useful. Plus that listing leaf files may suffer from data skew (one directory 
containing significantly more files than others).
    
    In the Parquet schema reading case, the file number is already known, and 
there's no data skew problem.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #13137: [SPARK-15247][SQL] Set the default number of partitions ...

Reply via email to