Robert Hou created DRILL-6276:
---------------------------------
Summary: Drill CTAS creates parquet file having page greater than
200 MB.
Key: DRILL-6276
URL: https://issues.apache.org/jira/browse/DRILL-6276
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Affects Versions: 1.13.0
Reporter: Robert Hou
Attachments: alltypes_asc_16MB.json
I used this CTAS to create a parquet file from a json file:
{noformat}
create table `alltypes.parquet` as select cast(BigIntValue as BigInt)
BigIntValue, cast(BooleanValue as Boolean) BooleanValue, cast (DateValue as
Date) DateValue, cast (FloatValue as Float) FloatValue, cast (DoubleValue as
Double) DoubleValue, cast (IntegerValue as Integer) IntegerValue, cast
(TimeValue as Time) TimeValue, cast (TimestampValue as Timestamp)
TimestampValue, cast (IntervalYearValue as INTERVAL YEAR) IntervalYearValue,
cast (IntervalDayValue as INTERVAL DAY) IntervalDayValue, cast
(IntervalSecondValue as INTERVAL SECOND) IntervalSecondValue, cast (BinaryValue
as binary) Binaryvalue, cast (VarcharValue as varchar) VarcharValue from
`alltypes.json`;
{noformat}
I ran parquet-tools/parquet-dump :
VarcharValue TV=6885 RL=0 DL=1
------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:17240317 VC:6885
The page size is 16MB. This is with a 16MB data set. When I try a similar 1GB
data set, the page size starts at over 200 MB, decreasing down to 1MB.
VarcharValue TV=208513 RL=0 DL=1
------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:215243750 VC:87433
page 1: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:112350266 VC:43717
page 2: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:52501154 VC:21859
page 3: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:27725498 VC:10930
page 4: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:12181241 VC:5466
page 5: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:11005971 VC:2734
page 6: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1133237 VC:1797
page 7: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1462803 VC:899
page 8: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050967 VC:490
page 9: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1051603 VC:424
page 10: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050919 VC:378
page 11: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050487 VC:345
page 12: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050783 VC:319
page 13: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1052303 VC:299
page 14: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1053235 VC:282
page 15: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1055979 VC:268
The column has a varchar, and the size varies from 2 bytes to 5000 bytes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)