[ 
https://issues.apache.org/jira/browse/PARQUET-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225630#comment-17225630
 ] 

Ruta Dhaneshwar commented on PARQUET-1936:
------------------------------------------

[~emkornfield] thanks for your response. Changing the parameters so "values" 
and the bitmap didn't include empty and null lists did solve the problem. Is it 
possible to clarify this in the comment for the WriteBatchSpaced function? The 
part about "... but the values include the null entries with definition level 
== (max_definition_level - 1)." was confusing and I don't think there is a 
mention about empty lists. Thanks again! 

> WriteBatchSpaced writes incorrect value for parquet when input contains NULL 
> list
> ---------------------------------------------------------------------------------
>
>                 Key: PARQUET-1936
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1936
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Ruta Dhaneshwar
>            Priority: Major
>         Attachments: NULL list 1.png, NULL list 2.png, NULL list 3.png, NULL 
> list 4.png, schema.png
>
>
> When trying to write a column of parquet lists, if there is a NULL list, 
> WriteBatchSpaced will either throw an error (case 1 below) or incorrectly 
> write the last value in the last list as the first value from the first list 
> (case 2 below).
> *!schema.png|width=235,height=106!*
> *CASE 1*
>  Data (3 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>   
>  Parameters to 
> TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
>  # num_values: 3
>  # def_levels: [3, 0, 3]
>  # rep_levels: [0, 0, 0]
>  # valid_bits: 0x05 (bit representation 101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two"]
> When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, 
> valid_bits_offset, values), I get an error when running 
> [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools]
>  on the outputted parquet file:
> !NULL list 1.png|width=332,height=52!
> !NULL list 2.png|width=757,height=249!
> Additionally, if I add another list into the data that I write, then the last 
> element of that additional list is incorrectly written as the first element 
> of the first list. See below.
>   
>  *CASE 2*
>  Data (4 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>  [
>     "three",
>     "four"
>  ]
>   
>  Parameters to 
> TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
>  # num_values: 5
>  # def_levels: [3, 0, 3, 3, 3]
>  # rep_levels: [0, 0, 0, 0, 1]
>  # valid_bits: 0x29 (bit representation 11101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two", "three", "four"]
> Outputted Parquet File: 
> !NULL list 3.png|width=72,height=145!
> !NULL list 4.png|width=237,height=76!
>   
>  Here we see that the "four" in the last list actually shows up as "one". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to