[ 
https://issues.apache.org/jira/browse/DRILL-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Challapalli updated DRILL-5472:
-------------------------------------
    Attachment: drill5472.sys.drill
                drill5472.parquet
                drill5472.log

Below is the value vector density information from the logs
{code}
DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Actual batch schema & sizes {
  mapid(type: VARCHAR, std col. size: 54, actual col. size: 8, total size: 
73728, data size: 56004, row capacity: 8191, density: 76)
  col1(type: BIGINT, std col. size: 8, actual col. size: 9, total size: 73728, 
data size: 72000, row capacity: 8192, density: 98)
  col2(type: BIGINT, std col. size: 8, actual col. size: 29, total size: 
294912, data size: 224004, row capacity: 8191, density: 76)
  a(type: VARCHAR, std col. size: 54, actual col. size: 7, total size: 196608, 
data size: 126004, row capacity: 32767, density: 65)
  b(type: VARCHAR, std col. size: 54, actual col. size: 6, total size: 196608, 
data size: 112004, row capacity: 32767, density: 57)
  c(type: BIGINT, std col. size: 8, actual col. size: 9, total size: 294912, 
data size: 180000, row capacity: 32768, density: 62)
  d(type: FLOAT8, std col. size: 8, actual col. size: 9, total size: 294912, 
data size: 180000, row capacity: 32768, density: 62)
  T10¦¦missing1(type: INT, std col. size: 4, actual col. size: 5, total size: 
327680, data size: 20000, row capacity: 65536, density: 7)
  T10¦¦missing2(type: VARCHAR, std col. size: 54, actual col. size: 11, total 
size: 4521984, data size: 40004, row capacity: 65535, density: 1)
  T10¦¦missing3(type: BIT, std col. size: 1, actual col. size: 2, total size: 
73728, data size: 8000, row capacity: 65536, density: 11)
  T10¦¦missing4(type: FLOAT8, std col. size: 8, actual col. size: 9, total 
size: 589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing5(type: VARCHAR, std col. size: 54, actual col. size: 10, total 
size: 4521984, data size: 36004, row capacity: 65535, density: 1)
  T10¦¦missing6(type: DATE, std col. size: 8, actual col. size: 9, total size: 
589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing7(type: FLOAT8, std col. size: 8, actual col. size: 9, total 
size: 589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing8(type: VARCHAR, std col. size: 54, actual col. size: 9, total 
size: 4521984, data size: 32004, row capacity: 65535, density: 1)
  T10¦¦missing9(type: INTERVAL, std col. size: 16, actual col. size: 17, total 
size: 1114112, data size: 68000, row capacity: 65536, density: 7)
  T10¦¦missing10(type: INTERVAL, std col. size: 16, actual col. size: 17, total 
size: 1114112, data size: 68000, row capacity: 65536, density: 7)
  T10¦¦missing11(type: INTERVAL, std col. size: 16, actual col. size: 17, total 
size: 1114112, data size: 68000, row capacity: 65536, density: 7)
  x(type: BIGINT, std col. size: 8, actual col. size: 9, total size: 36864, 
data size: 36000, row capacity: 4096, density: 98)
  y(type: FLOAT8, std col. size: 8, actual col. size: 9, total size: 36864, 
data size: 36000, row capacity: 4096, density: 98)
  T10¦¦missing13(type: VARCHAR, std col. size: 54, actual col. size: 22, total 
size: 4521984, data size: 84004, row capacity: 65535, density: 2)
  T10¦¦missing14(type: VARCHAR, std col. size: 54, actual col. size: 22, total 
size: 4521984, data size: 84004, row capacity: 65535, density: 2)
  T10¦¦missing15(type: VARCHAR, std col. size: 54, actual col. size: 38, total 
size: 4521984, data size: 148004, row capacity: 65535, density: 4)
  T10¦¦missing16(type: VARCHAR, std col. size: 54, actual col. size: 38, total 
size: 4521984, data size: 148004, row capacity: 65535, density: 4)
  T10¦¦missing17(type: VARCHAR, std col. size: 54, actual col. size: 10, total 
size: 4521984, data size: 36004, row capacity: 65535, density: 1)
  T10¦¦missing18(type: VARCHAR, std col. size: 54, actual col. size: 11, total 
size: 4521984, data size: 40004, row capacity: 65535, density: 1)
  T10¦¦missing19(type: VARCHAR, std col. size: 54, actual col. size: 39, total 
size: 4521984, data size: 152004, row capacity: 65535, density: 4)
  T10¦¦missing20(type: INT, std col. size: 4, actual col. size: 5, total size: 
327680, data size: 20000, row capacity: 65536, density: 7)
  T10¦¦missing21(type: INT, std col. size: 4, actual col. size: 5, total size: 
327680, data size: 20000, row capacity: 65536, density: 7)
  T10¦¦missing22(type: INT, std col. size: 4, actual col. size: 5, total size: 
327680, data size: 20000, row capacity: 65536, density: 7)
  T10¦¦missing23(type: VARBINARY, std col. size: 54, actual col. size: 10, 
total size: 4521984, data size: 36004, row capacity: 65535, density: 1)
  T10¦¦missing24(type: VARCHAR, std col. size: 54, actual col. size: 15, total 
size: 4521984, data size: 56004, row capacity: 65535, density: 2)
  T10¦¦missing25(type: VARCHAR, std col. size: 54, actual col. size: 17, total 
size: 4521984, data size: 64004, row capacity: 65535, density: 2)
  T10¦¦missing26(type: TIME, std col. size: 4, actual col. size: 5, total size: 
327680, data size: 20000, row capacity: 65536, density: 7)
  T10¦¦missing27(type: TIMESTAMP, std col. size: 8, actual col. size: 9, total 
size: 589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing28(type: DATE, std col. size: 8, actual col. size: 9, total size: 
589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing 29(type: BIGINT, std col. size: 8, actual col. size: 9, total 
size: 589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing30(type: DATE, std col. size: 8, actual col. size: 9, total size: 
589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing31(type: FLOAT8, std col. size: 8, actual col. size: 9, total 
size: 589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing32(type: TIME, std col. size: 4, actual col. size: 5, total size: 
327680, data size: 20000, row capacity: 65536, density: 7)
  T10¦¦missing33(type: TIMESTAMP, std col. size: 8, actual col. size: 9, total 
size: 589824, data size: 36000, row capacity: 65536, density: 7)
  T10¦¦missing34(type: VARCHAR, std col. size: 54, actual col. size: 49, total 
size: 4521984, data size: 192004, row capacity: 65535, density: 5)
  T10¦¦m1(type: INT, std col. size: 4, actual col. size: 5, total size: 327680, 
data size: 20000, row capacity: 65536, density: 7)
  EXPR$1(type: BIGINT, std col. size: 8, actual col. size: 9, total size: 
36864, data size: 36000, row capacity: 4096, density: 98)
  Records: 4000, Total size: 75870208, Gross row width:18969, Net row 
width:574, Density:22}
{code}

> Parquet reader generating low-density batches causing Sort operator to spill 
> un-necessarily
> -------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5472
>                 URL: https://issues.apache.org/jira/browse/DRILL-5472
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Storage - Parquet
>            Reporter: Rahul Challapalli
>            Assignee: Paul Rogers
>         Attachments: drill5472.log, drill5472.parquet, drill5472.sys.drill
>
>
> git.commit.id.abbrev=1e0a14c
> The parquet file used in the below query is ~20MB. The uncompressed size id 
> ~1.2 GB. Now the below query has a sort which is given ~6GB memory for a 
> single fragment and yet it spills.
> {code}
> select * from (select * from 
> dfs.`/drill/testdata/resource-manager/all_types_large` s order by 
> s.missing12.x) d where d.missing3 is false;
> {code}
> The profile indicates that the above query has spilled twice. Attached the 
> profile and the logs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to