Thank you Micah!
I spent a bit of time trying to get to the bottom of it (I know parquet pretty 
well, but not that familiar with arrow parquet inner workings) so if manage to 
track down the issue I’ll circle back (I give myself a 30% chance of success 
given the allotted time and expertise level)

> On Jul 30, 2020, at 12:31 AM, Micah Kornfield <emkornfi...@gmail.com> wrote:
> 
> I created https://issues.apache.org/jira/browse/ARROW-9598 to track.
> 
> On Wed, Jul 29, 2020 at 9:13 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
>> So I think the problem is within WriteLevelSpaced [1], specifically how we
>> calculate "min_spaced_def_level", seems incorrect (I think this only worked
>> for single nested lists).  This value probably needs to be calculated by
>> walking up the tree to find the def level of the first repeated value.
>> 
>> [1]
>> https://github.com/apache/arrow/blob/3586292d62c8c348e9fb85676eb524cde53179cf/cpp/src/parquet/column_writer.cc#L1141
>> 
>> On Wed, Jul 29, 2020 at 8:01 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> 
>>> Hi Radu,
>>> This appears to be a bug, would you mind filing a bug in JIRA?
>>> 
>>> I'm looking into it to see if I can figure out what is going on.
>>> 
>>> Thanks,
>>> Micah
>>> 
>>> On Wed, Jul 29, 2020 at 1:07 PM Radu Teodorescu
>>> <radukay...@yahoo.com.invalid> wrote:
>>> 
>>>> Is the current version supposed to allow struct columns with null values
>>>> to be written to parquet:
>>>> 
>>>> I narrowed it down to a two rows table with one column and two rows and
>>>> the resulting parquet file is broken both according to parquet-tools as
>>>> well as our own reader (it looks like a buffer is not written in full, but
>>>> I haven’t dug much deeper)
>>>> 
>>>> This is the table:
>>>> 
>>>> struct: struct<int: int64>
>>>>  child 0, int: int64
>>>> ----
>>>> struct:
>>>>  [
>>>>    -- is_valid:
>>>>          [
>>>>        false,
>>>>        true
>>>>      ]
>>>>    -- child 0 type: int64
>>>>      [
>>>>        null,
>>>>        2
>>>>      ]
>>>>  ]
>>>> 
>>>> and this is my repro table generation:
>>>> 
>>>> std::shared_ptr<arrow::Table> generate_table2() {
>>>>    auto i64builder = std::make_shared<arrow::Int64Builder>();
>>>>    const std::shared_ptr<arrow::DataType> structType =
>>>> arrow::struct_({arrow::field("int", arrow::int64())});
>>>>    arrow::StructBuilder structBuilder(structType,
>>>> arrow::default_memory_pool(), {
>>>>            std::static_pointer_cast<arrow::ArrayBuilder>(i64builder)});
>>>>    PARQUET_THROW_NOT_OK(structBuilder.AppendNull());
>>>>    PARQUET_THROW_NOT_OK(structBuilder.Append());
>>>>    PARQUET_THROW_NOT_OK(i64builder->Append(2));
>>>>    std::shared_ptr<arrow::Array> structArray;
>>>>    PARQUET_THROW_NOT_OK(structBuilder.Finish(&structArray));
>>>>    std::shared_ptr<arrow::Schema> schema =
>>>> arrow::schema({arrow::field("struct",structType)});
>>>>    return arrow::Table::Make(schema, {structArray});
>>>> }
>>>> Is this a bug, know limitation or am I doing something dumb?
>>>> 
>>>> Thank you
>>>> Radu
>>>> 
>>>> 

Reply via email to