[ https://issues.apache.org/jira/browse/PARQUET-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232503#comment-17232503 ]
Micah Kornfield commented on PARQUET-1936: ------------------------------------------ [~Ruta Dhaneshwar] sure, do you maybe want to make a PR to clarify? > WriteBatchSpaced writes incorrect value for parquet when input contains NULL > list > --------------------------------------------------------------------------------- > > Key: PARQUET-1936 > URL: https://issues.apache.org/jira/browse/PARQUET-1936 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Ruta Dhaneshwar > Priority: Major > Attachments: NULL list 1.png, NULL list 2.png, NULL list 3.png, NULL > list 4.png, schema.png > > > When trying to write a column of parquet lists, if there is a NULL list, > WriteBatchSpaced will either throw an error (case 1 below) or incorrectly > write the last value in the last list as the first value from the first list > (case 2 below). > *!schema.png|width=235,height=106!* > *CASE 1* > Data (3 lists): > [ > "one" > ] > null > [ > "two" > ] > > Parameters to > TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced: > # num_values: 3 > # def_levels: [3, 0, 3] > # rep_levels: [0, 0, 0] > # valid_bits: 0x05 (bit representation 101) > # valid_bits_offset: 0 > # values: ["one", nullptr, "two"] > When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, > valid_bits_offset, values), I get an error when running > [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] > on the outputted parquet file: > !NULL list 1.png|width=332,height=52! > !NULL list 2.png|width=757,height=249! > Additionally, if I add another list into the data that I write, then the last > element of that additional list is incorrectly written as the first element > of the first list. See below. > > *CASE 2* > Data (4 lists): > [ > "one" > ] > null > [ > "two" > ] > [ > "three", > "four" > ] > > Parameters to > TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced: > # num_values: 5 > # def_levels: [3, 0, 3, 3, 3] > # rep_levels: [0, 0, 0, 0, 1] > # valid_bits: 0x29 (bit representation 11101) > # valid_bits_offset: 0 > # values: ["one", nullptr, "two", "three", "four"] > Outputted Parquet File: > !NULL list 3.png|width=72,height=145! > !NULL list 4.png|width=237,height=76! > > Here we see that the "four" in the last list actually shows up as "one". -- This message was sent by Atlassian Jira (v8.3.4#803005)