svilupp commented on issue #403: URL: https://github.com/apache/arrow-julia/issues/403#issuecomment-1465275189
I think I know where it's coming from. The issue happens [here](https://github.com/apache/arrow-julia/blob/9b36c8b1ec9efbdc63009d1b8cd72ee705fc1711/src/write.jl#L196) - Only the first partition is scanned to determine the schema - Unfortunately, the partition of DataFrameRows loses the parent schema when pushed through Tables.columns - It does however keep the reference to the parent (and its schema) In other words, we do `partition |> Tables.columns |> Tables.schema`, which loses the missingness. I don't know enough about the Tables API/contract to know whether this is an Arrow problem, Tables problem, or DataFrames problem. Does this issue belong somewhere else? It would be an easy fix to get schema info from the parent object, but are all Tables-compatible sources required to keep that? Eg, - change from `partition |> Tables.columns |> Tables.schema` - to `partition |> Tables.columns |> Base.Fix2(getfield,:parent) |> Tables.schema` Illustration ``` # correct when working with Tables object df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) for r in Iterators.partition(Tables.rows(df), 2) @info "Parent type: $(x.parent.x1|>eltype)" @info "Columns type: $(Tables.columns(r)|>Tables.schema)" end [ Info: Parent type: Union{Missing, String} ┌ Info: Columns type: Tables.Schema: │ :x1 String └ :x2 Int64 [ Info: Parent type: Union{Missing, String} ┌ Info: Columns type: Tables.Schema: │ :x1 Union{Missing, String} └ :x2 Int64 # incorrect when working with DataFrame df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame for r in Iterators.partition(Tables.rows(df), 2) @info "Parent type: $(x.parent.x1|>eltype)" @info "Columns type: $(Tables.columns(r)|>Tables.schema)" end [ Info: Parent type: Union{Missing, String} ┌ Info: Columns type: Tables.Schema: │ :x1 String └ :x2 Int64 [ Info: Parent type: Union{Missing, String} ┌ Info: Columns type: Tables.Schema: │ :x1 Union{Missing, String} └ :x2 Int64 ``` EDIT: I suspect this will effect other partitioners that rely on Iterators over `Tables.rows()`, eg, `TableOperations.makepartition()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
