[ 
https://issues.apache.org/jira/browse/DRILL-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Hou reassigned DRILL-6276:
---------------------------------

    Assignee: Pritesh Maker

> Drill CTAS creates parquet file having page greater than 200 MB.
> ----------------------------------------------------------------
>
>                 Key: DRILL-6276
>                 URL: https://issues.apache.org/jira/browse/DRILL-6276
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.13.0
>            Reporter: Robert Hou
>            Assignee: Pritesh Maker
>            Priority: Major
>         Attachments: alltypes_asc_16MB.json
>
>
> I used this CTAS to create a parquet file from a json file:
> {noformat}
> create table `alltypes.parquet` as select cast(BigIntValue as BigInt) 
> BigIntValue, cast(BooleanValue as Boolean) BooleanValue, cast (DateValue as 
> Date) DateValue, cast (FloatValue as Float) FloatValue, cast (DoubleValue as 
> Double) DoubleValue, cast (IntegerValue as Integer) IntegerValue, cast 
> (TimeValue as Time) TimeValue, cast (TimestampValue as Timestamp) 
> TimestampValue, cast (IntervalYearValue as INTERVAL YEAR) IntervalYearValue, 
> cast (IntervalDayValue as INTERVAL DAY) IntervalDayValue, cast 
> (IntervalSecondValue as INTERVAL SECOND) IntervalSecondValue, cast 
> (BinaryValue as binary) Binaryvalue, cast (VarcharValue as varchar) 
> VarcharValue from `alltypes.json`;
> {noformat}
> I ran parquet-tools/parquet-dump :
>     VarcharValue TV=6885 RL=0 DL=1
>     
> ------------------------------------------------------------------------------------------------
>     page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:17240317 VC:6885
> The page size is 16MB.  This is with a 16MB data set.  When I try a similar 
> 1GB data set, the page size starts at over 200 MB, decreasing down to 1MB.
>     VarcharValue TV=208513 RL=0 DL=1
>     
> ------------------------------------------------------------------------------------------------
>     page 0:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:215243750 VC:87433
>     page 1:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:112350266 VC:43717
>     page 2:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:52501154 VC:21859
>     page 3:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:27725498 VC:10930
>     page 4:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:12181241 VC:5466
>     page 5:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:11005971 VC:2734
>     page 6:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1133237 VC:1797
>     page 7:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1462803 VC:899
>     page 8:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050967 VC:490
>     page 9:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1051603 VC:424
>     page 10:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050919 VC:378
>     page 11:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050487 VC:345
>     page 12:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050783 VC:319
>     page 13:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1052303 VC:299
>     page 14:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1053235 VC:282
>     page 15:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1055979 VC:268
> The column has a varchar, and the size varies from 2 bytes to 5000 bytes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to