[ 
https://issues.apache.org/jira/browse/PARQUET-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruta Dhaneshwar updated PARQUET-1936:
-------------------------------------
    Description: 
When trying to write a column of parquet lists, if there is a NULL list, 
WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write 
the last value in the last list as the first value from the first list (case 2 
below).
  
 Schema:
 message schema {

    optional group _COL_0 (LIST) {

        repeated group list

{             optional binary item (UTF8);         }

    }
 }
  
 *CASE 1*
 Data (3 lists):
 [
    "one"
 ]
 null
 [
    "two"
 ]
  
 Parameters to 
TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
 # num_values: 3
 # def_levels: [3, 0, 3]
 # rep_levels: [0, 0, 0]
 # valid_bits: 0x05 (bit representation 101)
 # valid_bits_offset: 0
 # values: ["one", nullptr, "two"]

When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, 
valid_bits_offset, values), I get an error when running 
[parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] 
on the outputted parquet file:

!NULL list 1.png|width=332,height=52!

!NULL list 2.png|width=757,height=249!

Additionally, if I add another list into the data that I write, then the last 
element of that additional list is incorrectly written as the first element of 
the first list. See below.
  
 *CASE 2*
 Data (4 lists):
 [
    "one"
 ]
 null
 [
    "two"
 ]
 [
    "three",
    "four"
 ]
  
 Parameters to 
TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
 # num_values: 5
 # def_levels: [3, 0, 3, 3, 3]
 # rep_levels: [0, 0, 0, 0, 1]
 # valid_bits: 0x29 (bit representation 11101)
 # valid_bits_offset: 0
 # values: ["one", nullptr, "two", "three", "four"]

Outputted Parquet File: 

!NULL list 3.png|width=72,height=145!

!NULL list 4.png|width=237,height=76!
  
 Here we see that the "four" in the last list actually shows up as "one". 

  was:
When trying to write a column of parquet lists, if there is a NULL list, 
WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write 
the last value in the last list as the first value from the first list (case 2 
below).
  
 Schema:
 message schema {

    optional group _COL_0 (LIST) {

        repeated group list

{             optional binary item (UTF8);         }

    }
 }
  
 *CASE 1*
 Data (3 lists):
 [
    "one"
 ]
 null
 [
    "two"
 ]
  
 Parameters to 
TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
 # num_values: 3
 # def_levels: [3, 0, 3]
 # rep_levels: [0, 0, 0]
 # valid_bits: 0x05 (bit representation 101)
 # valid_bits_offset: 0
 # values: ["one", nullptr, "two"]

When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, 
valid_bits_offset, values), I get an error when running 
[parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] 
on the outputted parquet file:

!NULL list 1.png|width=332,height=52!

!NULL list 2.png|width=757,height=249!

Additionally, if I add another list into the data that I write, then the last 
element of that additional list is incorrectly written as the first element of 
the first list. See below.
  
 *CASE 2*
 Data (4 lists):
 [
    "one"
 ]
 null
 [
    "two"
 ]
 [
    "three",
    "four"
 ]
  
 Parameters to 
TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
 # num_values: 5
 # def_levels: [3, 0, 3, 3, 3]
 # rep_levels: [0, 0, 0, 0, 1]
 # valid_bits: 0x29 (bit representation 11101)
 # valid_bits_offset: 0
 # values: ["one", nullptr, "two", "three", "four"]

Outputted Parquet File: (see NULL list 3 and NULL list 4)
  
 Here we see that the "four" in the last list actually shows up as "one". 


> WriteBatchSpaced writes incorrect value for parquet when input contains NULL 
> list
> ---------------------------------------------------------------------------------
>
>                 Key: PARQUET-1936
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1936
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Ruta Dhaneshwar
>            Priority: Major
>         Attachments: NULL list 1.png, NULL list 2.png, NULL list 3.png, NULL 
> list 4.png
>
>
> When trying to write a column of parquet lists, if there is a NULL list, 
> WriteBatchSpaced will either throw an error (case 1 below) or incorrectly 
> write the last value in the last list as the first value from the first list 
> (case 2 below).
>   
>  Schema:
>  message schema {
>     optional group _COL_0 (LIST) {
>         repeated group list
> {             optional binary item (UTF8);         }
>     }
>  }
>   
>  *CASE 1*
>  Data (3 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>   
>  Parameters to 
> TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
>  # num_values: 3
>  # def_levels: [3, 0, 3]
>  # rep_levels: [0, 0, 0]
>  # valid_bits: 0x05 (bit representation 101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two"]
> When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, 
> valid_bits_offset, values), I get an error when running 
> [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools]
>  on the outputted parquet file:
> !NULL list 1.png|width=332,height=52!
> !NULL list 2.png|width=757,height=249!
> Additionally, if I add another list into the data that I write, then the last 
> element of that additional list is incorrectly written as the first element 
> of the first list. See below.
>   
>  *CASE 2*
>  Data (4 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>  [
>     "three",
>     "four"
>  ]
>   
>  Parameters to 
> TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
>  # num_values: 5
>  # def_levels: [3, 0, 3, 3, 3]
>  # rep_levels: [0, 0, 0, 0, 1]
>  # valid_bits: 0x29 (bit representation 11101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two", "three", "four"]
> Outputted Parquet File: 
> !NULL list 3.png|width=72,height=145!
> !NULL list 4.png|width=237,height=76!
>   
>  Here we see that the "four" in the last list actually shows up as "one". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to