[
https://issues.apache.org/jira/browse/DRILL-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Hou reassigned DRILL-6276:
---------------------------------
Assignee: Pritesh Maker
> Drill CTAS creates parquet file having page greater than 200 MB.
> ----------------------------------------------------------------
>
> Key: DRILL-6276
> URL: https://issues.apache.org/jira/browse/DRILL-6276
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.13.0
> Reporter: Robert Hou
> Assignee: Pritesh Maker
> Priority: Major
> Attachments: alltypes_asc_16MB.json
>
>
> I used this CTAS to create a parquet file from a json file:
> {noformat}
> create table `alltypes.parquet` as select cast(BigIntValue as BigInt)
> BigIntValue, cast(BooleanValue as Boolean) BooleanValue, cast (DateValue as
> Date) DateValue, cast (FloatValue as Float) FloatValue, cast (DoubleValue as
> Double) DoubleValue, cast (IntegerValue as Integer) IntegerValue, cast
> (TimeValue as Time) TimeValue, cast (TimestampValue as Timestamp)
> TimestampValue, cast (IntervalYearValue as INTERVAL YEAR) IntervalYearValue,
> cast (IntervalDayValue as INTERVAL DAY) IntervalDayValue, cast
> (IntervalSecondValue as INTERVAL SECOND) IntervalSecondValue, cast
> (BinaryValue as binary) Binaryvalue, cast (VarcharValue as varchar)
> VarcharValue from `alltypes.json`;
> {noformat}
> I ran parquet-tools/parquet-dump :
> VarcharValue TV=6885 RL=0 DL=1
>
> ------------------------------------------------------------------------------------------------
> page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:17240317 VC:6885
> The page size is 16MB. This is with a 16MB data set. When I try a similar
> 1GB data set, the page size starts at over 200 MB, decreasing down to 1MB.
> VarcharValue TV=208513 RL=0 DL=1
>
> ------------------------------------------------------------------------------------------------
> page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:215243750 VC:87433
> page 1: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:112350266 VC:43717
> page 2: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:52501154 VC:21859
> page 3: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:27725498 VC:10930
> page 4: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:12181241 VC:5466
> page 5: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:11005971 VC:2734
> page 6: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1133237 VC:1797
> page 7: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1462803 VC:899
> page 8: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050967 VC:490
> page 9: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1051603 VC:424
> page 10: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050919 VC:378
> page 11: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050487 VC:345
> page 12: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050783 VC:319
> page 13: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1052303 VC:299
> page 14: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1053235 VC:282
> page 15: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1055979 VC:268
> The column has a varchar, and the size varies from 2 bytes to 5000 bytes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)