[jira] [Commented] (PARQUET-1936) WriteBatchSpaced writes incorrect value for parquet when input contains NULL list

2020-11-15 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232503#comment-17232503
 ] 

Micah Kornfield commented on PARQUET-1936:
--

[~Ruta Dhaneshwar] sure, do you maybe want to make a PR to clarify?

> WriteBatchSpaced writes incorrect value for parquet when input contains NULL 
> list
> -
>
> Key: PARQUET-1936
> URL: https://issues.apache.org/jira/browse/PARQUET-1936
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Ruta Dhaneshwar
>Priority: Major
> Attachments: NULL list 1.png, NULL list 2.png, NULL list 3.png, NULL 
> list 4.png, schema.png
>
>
> When trying to write a column of parquet lists, if there is a NULL list, 
> WriteBatchSpaced will either throw an error (case 1 below) or incorrectly 
> write the last value in the last list as the first value from the first list 
> (case 2 below).
> *!schema.png|width=235,height=106!*
> *CASE 1*
>  Data (3 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>   
>  Parameters to 
> TypedColumnWriter>::WriteBatchSpaced:
>  # num_values: 3
>  # def_levels: [3, 0, 3]
>  # rep_levels: [0, 0, 0]
>  # valid_bits: 0x05 (bit representation 101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two"]
> When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, 
> valid_bits_offset, values), I get an error when running 
> [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools]
>  on the outputted parquet file:
> !NULL list 1.png|width=332,height=52!
> !NULL list 2.png|width=757,height=249!
> Additionally, if I add another list into the data that I write, then the last 
> element of that additional list is incorrectly written as the first element 
> of the first list. See below.
>   
>  *CASE 2*
>  Data (4 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>  [
>     "three",
>     "four"
>  ]
>   
>  Parameters to 
> TypedColumnWriter>::WriteBatchSpaced:
>  # num_values: 5
>  # def_levels: [3, 0, 3, 3, 3]
>  # rep_levels: [0, 0, 0, 0, 1]
>  # valid_bits: 0x29 (bit representation 11101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two", "three", "four"]
> Outputted Parquet File: 
> !NULL list 3.png|width=72,height=145!
> !NULL list 4.png|width=237,height=76!
>   
>  Here we see that the "four" in the last list actually shows up as "one". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1936) WriteBatchSpaced writes incorrect value for parquet when input contains NULL list

2020-10-30 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223995#comment-17223995
 ] 

Micah Kornfield commented on PARQUET-1936:
--

[~Ruta Dhaneshwar] part of this might be related to PARQUET-1935 however, part 
of this is issue  might however be incorrect usage of the API.  In both cases, 
the "values" array should not contain any elements corresponding to empty or 
null lists (and neither should the bitmap).

 

Thank for you for the bug report, it is generally most helpful if you provide 
minimal code that can reproduce the issues instead of providing pseudo-code.

> WriteBatchSpaced writes incorrect value for parquet when input contains NULL 
> list
> -
>
> Key: PARQUET-1936
> URL: https://issues.apache.org/jira/browse/PARQUET-1936
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Ruta Dhaneshwar
>Priority: Major
> Attachments: NULL list 1.png, NULL list 2.png, NULL list 3.png, NULL 
> list 4.png, schema.png
>
>
> When trying to write a column of parquet lists, if there is a NULL list, 
> WriteBatchSpaced will either throw an error (case 1 below) or incorrectly 
> write the last value in the last list as the first value from the first list 
> (case 2 below).
> *!schema.png|width=235,height=106!*
> *CASE 1*
>  Data (3 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>   
>  Parameters to 
> TypedColumnWriter>::WriteBatchSpaced:
>  # num_values: 3
>  # def_levels: [3, 0, 3]
>  # rep_levels: [0, 0, 0]
>  # valid_bits: 0x05 (bit representation 101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two"]
> When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, 
> valid_bits_offset, values), I get an error when running 
> [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools]
>  on the outputted parquet file:
> !NULL list 1.png|width=332,height=52!
> !NULL list 2.png|width=757,height=249!
> Additionally, if I add another list into the data that I write, then the last 
> element of that additional list is incorrectly written as the first element 
> of the first list. See below.
>   
>  *CASE 2*
>  Data (4 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>  [
>     "three",
>     "four"
>  ]
>   
>  Parameters to 
> TypedColumnWriter>::WriteBatchSpaced:
>  # num_values: 5
>  # def_levels: [3, 0, 3, 3, 3]
>  # rep_levels: [0, 0, 0, 0, 1]
>  # valid_bits: 0x29 (bit representation 11101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two", "three", "four"]
> Outputted Parquet File: 
> !NULL list 3.png|width=72,height=145!
> !NULL list 4.png|width=237,height=76!
>   
>  Here we see that the "four" in the last list actually shows up as "one". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)