[ 
https://issues.apache.org/jira/browse/DRILL-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933909#comment-15933909
 ] 

Paul Rogers commented on DRILL-5267:
------------------------------------

This issue cannot be easily tested from the outside of Drill. Here's what you'd 
have to do.

* Enable debug logging. Will slow down execution, but we're not worried about 
speed here.
* Run a query on a Parquet file.
* Look at the log for an entry that explains the row size. Look for a line that 
contains "actual col. size:" and look at the surrounding messages.
* You will see a "density" metric in percent.

If the density is below 75%, then we are not making good use of allocated 
space. If density is low, then:

* The sort must buffer the entire incoming record batch, including wasted space.
* If the data size is 1 GB (uncompressed), say and density is 50%, then sort 
will need 2 GB of memory to buffer the data.
* If you give the sort only 1 GB of memory, you should see the sort spill 
multiple times (at 256 MB per spill). Maybe about four spills in this case.
* You can count the spills either using log messages or using the spill count 
metric in the query profile.

All that said, we know that the Parquet reader still creates low-density 
batches if any column is variable-length. The fix made only applies to 
fixed-length columns.

> Managed external sort spills too often with Parquet data
> --------------------------------------------------------
>
>                 Key: DRILL-5267
>                 URL: https://issues.apache.org/jira/browse/DRILL-5267
>             Project: Apache Drill
>          Issue Type: Sub-task
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.10.0
>
>
> DRILL-5266 describes how Parquet produces low-density record batches. The 
> result of these batches is that the external sort spills more frequently than 
> it should because it sizes spill files based on batch size, not data content 
> of the batch. Since Parquet batches are 95% empty space, the spill files end 
> up far too small.
> Adjust the spill calculations based on actual data content, not the size of 
> the overall record batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to