[ 
https://issues.apache.org/jira/browse/DRILL-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868230#comment-15868230
 ] 

Paul Rogers commented on DRILL-5267:
------------------------------------

The query for this case is:

{code}
select * from dfs.`/some_file.parquet` order by c_email_address
{code}

The file was created (by another person) as the CTTAS of a join on TPC-H data.

When run locally, using the unmanaged sort, we get the following results: total 
time (debug mode):

{code}
Results: 1,434,519 records, 4233 batches, 69,294 ms
{code}

The old sort used small spill batches:

{code}
read 339 records
{code}

With the managed sort, we get:

{code}
Actual batch schema & sizes {
  T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 
196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 
196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 
196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
...
  c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, 
vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
  Records: 1129, Total size: 32006144, Row width:28350, Density:5}

Memory Estimates: record size = 335 bytes; input batch = 32006144 bytes, 1129 
records; 
merge batch size = 8388608 bytes, 25040 records; 
output batch size = 16777216 bytes, 50081 records; 
Available memory: 2147483648, spill point = 48783360, min. merge memory = 
117440512

...
Starting spill from memory. Memory = 2079733760, Buffered batch count = 65, 
Spill batch count = 8
mergeAndSpill: completed, memory = 2090776608, spilled 9032 records to 
/tmp/.../spill3
{code}

Let's look at the spill files:

{code}
3,841,038 spill1
3,836,810 spill2
3,834,634 spill3
3,846,039 spill4
{code}

This shows us the impact of the low-density batches. Spill files are supposed 
to be 256 MB in size. But, we have 2 GB of memory, 5% density, so we can hold 
only 102 MB of actual data. Need to figure out why the spill files are only 3 
MB.

Looking more carefully, the memory estimates say that a merge batch should be 
25040 record, but the spill code says it spilled only 9032 records.

Ah, the issue is the darn low-density batches again: we are trying to use the 
batch size to compute how much to actually spill:

{code}
    long estSize = 0;
    int spillCount = 0;
    for (InputBatch batch : bufferedBatches) {
      estSize += batch.getDataSize();
      if (estSize > spillFileSize) {
        break; }
      spillCount++;
    }
{code}

But, since batches are low-density, batch size is a very poor proxy for actual 
on-disk size.

With the above fix:

{code}
Input Batch Estimates: record size = 335 bytes; input batch = 32006144 bytes, 
1129 records
...
Starting spill from memory. .. Buffered batch count = 65, Spill batch count = 65
...
Results: 1,434,519 records, 31 batches, 37,586 ms

31,187,622 spill1
{code}

Spill files are 10x larger and run time is about half the unmanaged sort. All 
batches are being spilled. The number of batches used to deliver the output 
dropped from 4233 to 31. So far so good.

Since each batch is 32 MB, memory hold 32 MB * 65 = 2,080 MB = 2 GB of vectors. 
But, those vectors, combined, hold only 2 GB / 20 = 100 MB of data. But the 
file is only 31 MB, so something is still wrong.

To compute this a different way:

{code}
65 batches * 1129 records/batch * 335 bytes/record = 24 MB
{code}

Which does agree (more or less) with file size. This means that density is 
actually:

{code}
24 MB / 2 GB = 1.2%
{code}

Something is seriously bizarre... This is telling us that of the 2 GB used up 
to hold batches, only 24 MB of useful data is being stored. This is really bad 
news!

The sort is now doing the best it can with the batches it has been given. The 
next challenge is to understand why the batches hold so little data.

> Managed external sort spills too often with Parquet data
> --------------------------------------------------------
>
>                 Key: DRILL-5267
>                 URL: https://issues.apache.org/jira/browse/DRILL-5267
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.10
>
>
> DRILL-5266 describes how Parquet produces low-density record batches. The 
> result of these batches is that the external sort spills more frequently than 
> it should because it sizes spill files based on batch size, not data content 
> of the batch. Since Parquet batches are 95% empty space, the spill files end 
> up far too small.
> Adjust the spill calculations based on actual data content, not the size of 
> the overall record batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to