svilupp commented on issue #403:
URL: https://github.com/apache/arrow-julia/issues/403#issuecomment-1465275189

   I think I know where it's coming from.
   
   The issue happens 
[here](https://github.com/apache/arrow-julia/blob/9b36c8b1ec9efbdc63009d1b8cd72ee705fc1711/src/write.jl#L196)
   - Only the first partition is scanned to determine the schema
   - Unfortunately, the partition of DataFrameRows loses the parent schema when 
pushed through Tables.columns
   - It does however keep the reference to the parent (and its schema)
   
   In other words, we do `partition |> Tables.columns |> Tables.schema`, which 
loses the missingness.
   
   I don't know enough about the Tables API/contract to know whether this is an 
Arrow problem, Tables problem, or DataFrames problem. Does this issue belong 
somewhere else?
   
   It would be an easy fix to get schema info from the parent object, but are 
all Tables-compatible sources required to keep that?
   
   Eg, 
   - change from `partition |> Tables.columns |> Tables.schema` 
   - to `partition |> Tables.columns |> Base.Fix2(getfield,:parent) |> 
Tables.schema`
   
   Illustration
   ```
   # correct when working with Tables object
   df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
   for r in Iterators.partition(Tables.rows(df), 2)
       @info "Parent type: $(x.parent.x1|>eltype)"
       @info "Columns type: $(Tables.columns(r)|>Tables.schema)"
   end
     [ Info: Parent type: Union{Missing, String}
     ┌ Info: Columns type: Tables.Schema:
     │  :x1  String
     └  :x2  Int64
     [ Info: Parent type: Union{Missing, String}
     ┌ Info: Columns type: Tables.Schema:
     │  :x1  Union{Missing, String}
     └  :x2  Int64
   
   # incorrect when working with DataFrame
   df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
   for r in Iterators.partition(Tables.rows(df), 2)
       @info "Parent type: $(x.parent.x1|>eltype)"
       @info "Columns type: $(Tables.columns(r)|>Tables.schema)"
   end
     [ Info: Parent type: Union{Missing, String}
     ┌ Info: Columns type: Tables.Schema:
     │  :x1  String
     └  :x2  Int64
     [ Info: Parent type: Union{Missing, String}
     ┌ Info: Columns type: Tables.Schema:
     │  :x1  Union{Missing, String}
     └  :x2  Int64
   ```
   
   EDIT: I suspect this will effect other partitioners that rely on Iterators 
over `Tables.rows()`, eg, `TableOperations.makepartition()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to