[jira] [Commented] (ARROW-18439) Misleading message when loading parquet data with invalid null data

&res (Jira) Fri, 16 Dec 2022 02:38:10 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-18439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648555#comment-17648555
 ]


&res commented on ARROW-18439:
------------------------------

As a general comment, it is quiet easy to create data that is not valid in 
terms of nullability in arrow. In the example above I was able to create a 
table where the nullability of the fields is not respected.

And, this would pass:
{code:java}
table.validate(full=True) {code}
But this would throw ArrowInvalid:
{code:java}
table.cast(table.schema) {code}
 

> Misleading message when loading parquet data with invalid null data
> -------------------------------------------------------------------
>
>                 Key: ARROW-18439
>                 URL: https://issues.apache.org/jira/browse/ARROW-18439
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 10.0.1
>            Reporter: &res
>            Priority: Minor
>
> I'm saving an arrow table to parquet. One column is a list of structs, which 
> elements are marked as non nullable. But the data isn't valid because I've 
> put a null in one of the nested field. 
> When I save this data to parquet and try to load it back I get a very 
> misleading message:
> {code:java}
>  Length spanned by list offsets (2) larger than values array (length 1){code}
> I would rather arrow complains when creating the table or when saving it to 
> parquet.
> Here's how to reproduce the issue:
> {code:java}
> struct = pa.struct(
>     [
>         pa.field("nested_string", pa.string(), nullable=False),
>     ]
> )
> schema = pa.schema(
>     [pa.field("list_column", pa.list_(pa.field("item", struct, 
> nullable=False)))]
> )
> table = pa.table(
>     {"list_column": [[{"nested_string": ""}, {"nested_string": None}]]}, 
> schema=schema
> )
> with io.BytesIO() as file:
>     pq.write_table(table, file)
>     file.seek(0)
>     pq.read_table(file) # Raises pa.ArrowInvalid
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18439) Misleading message when loading parquet data with invalid null data

Reply via email to