Re: SPARK-8813 - combining small files in spark sql

Reynold Xin Thu, 07 Jul 2016 11:49:07 -0700

When using native data sources (e.g. Parquet, ORC, JSON, ...), partitions
are automatically merged so they would add up to a specific size,
configurable by spark.sql.files.maxPartitionBytes.


spark.sql.files.openCostInBytes is used to specify the cost of each "file".
That is, an empty file will be considered to have at
least spark.sql.files.openCostInBytes bytes.

On Wed, Jul 6, 2016 at 11:53 PM, Ajay Srivastava <
a_k_srivast...@yahoo.com.invalid> wrote:

> Hi,
>
> This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in
> spark 2.0.
> But resolution is not mentioned there.
>
> In our use case, there are big as well as many small parquet files which
> are being queried using spark sql.
> Can someone please explain what is the fix and how I can use it in spark
> 2.0 ? I did search commits done in 2.0 branch and looks like I need to
> use spark.sql.files.openCostInBytes but I am not sure.
>
>
> Regards,
> Ajay
>

Re: SPARK-8813 - combining small files in spark sql

Reply via email to