Paul Rogers created DRILL-5266:
----------------------------------

             Summary: Parquet Reader produces "low density" record batches
                 Key: DRILL-5266
                 URL: https://issues.apache.org/jira/browse/DRILL-5266
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.10
            Reporter: Paul Rogers


Testing with the managed sort revealed that, for at least one file, Parquet 
produces "low-density" batches: batches in which only 5% of each value vector 
contains actual data, with the rest being unused space. When fed into the sort, 
we end up buffering 95% of wasted space, using only 5% of available memory to 
hold actual query data. The result is poor performance of the sort as it must 
spill far more frequently than expected.

The managed sort analyzes incoming batches to prepare good memory use 
estimates. The following the the output from the Parquet file in question:

{code}
Actual batch schema & sizes {
  T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 
196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 
196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 
196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
...
  c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, 
vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
  Records: 1129, Total size: 32006144, Row width:28350, Density:5}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to