Robert Hou created DRILL-6276:
---------------------------------

             Summary: Drill CTAS creates parquet file having page greater than 
200 MB.
                 Key: DRILL-6276
                 URL: https://issues.apache.org/jira/browse/DRILL-6276
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.13.0
            Reporter: Robert Hou
         Attachments: alltypes_asc_16MB.json

I used this CTAS to create a parquet file from a json file:
{noformat}
create table `alltypes.parquet` as select cast(BigIntValue as BigInt) 
BigIntValue, cast(BooleanValue as Boolean) BooleanValue, cast (DateValue as 
Date) DateValue, cast (FloatValue as Float) FloatValue, cast (DoubleValue as 
Double) DoubleValue, cast (IntegerValue as Integer) IntegerValue, cast 
(TimeValue as Time) TimeValue, cast (TimestampValue as Timestamp) 
TimestampValue, cast (IntervalYearValue as INTERVAL YEAR) IntervalYearValue, 
cast (IntervalDayValue as INTERVAL DAY) IntervalDayValue, cast 
(IntervalSecondValue as INTERVAL SECOND) IntervalSecondValue, cast (BinaryValue 
as binary) Binaryvalue, cast (VarcharValue as varchar) VarcharValue from 
`alltypes.json`;
{noformat}

I ran parquet-tools/parquet-dump :

    VarcharValue TV=6885 RL=0 DL=1
    
------------------------------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:17240317 VC:6885

The page size is 16MB.  This is with a 16MB data set.  When I try a similar 1GB 
data set, the page size starts at over 200 MB, decreasing down to 1MB.

    VarcharValue TV=208513 RL=0 DL=1
    
------------------------------------------------------------------------------------------------
    page 0:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:215243750 VC:87433
    page 1:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:112350266 VC:43717
    page 2:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:52501154 VC:21859
    page 3:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:27725498 VC:10930
    page 4:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:12181241 VC:5466
    page 5:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:11005971 VC:2734
    page 6:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1133237 VC:1797
    page 7:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1462803 VC:899
    page 8:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050967 VC:490
    page 9:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1051603 VC:424
    page 10:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050919 VC:378
    page 11:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050487 VC:345
    page 12:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050783 VC:319
    page 13:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1052303 VC:299
    page 14:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1053235 VC:282
    page 15:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1055979 VC:268

The column has a varchar, and the size varies from 2 bytes to 5000 bytes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to