[
https://issues.apache.org/jira/browse/ARROW-18439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648555#comment-17648555
]
&res commented on ARROW-18439:
------------------------------
As a general comment, it is quiet easy to create data that is not valid in
terms of nullability in arrow. In the example above I was able to create a
table where the nullability of the fields is not respected.
And, this would pass:
{code:java}
table.validate(full=True) {code}
But this would throw ArrowInvalid:
{code:java}
table.cast(table.schema) {code}
> Misleading message when loading parquet data with invalid null data
> -------------------------------------------------------------------
>
> Key: ARROW-18439
> URL: https://issues.apache.org/jira/browse/ARROW-18439
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 10.0.1
> Reporter: &res
> Priority: Minor
>
> I'm saving an arrow table to parquet. One column is a list of structs, which
> elements are marked as non nullable. But the data isn't valid because I've
> put a null in one of the nested field.
> When I save this data to parquet and try to load it back I get a very
> misleading message:
> {code:java}
> Length spanned by list offsets (2) larger than values array (length 1){code}
> I would rather arrow complains when creating the table or when saving it to
> parquet.
> Here's how to reproduce the issue:
> {code:java}
> struct = pa.struct(
> [
> pa.field("nested_string", pa.string(), nullable=False),
> ]
> )
> schema = pa.schema(
> [pa.field("list_column", pa.list_(pa.field("item", struct,
> nullable=False)))]
> )
> table = pa.table(
> {"list_column": [[{"nested_string": ""}, {"nested_string": None}]]},
> schema=schema
> )
> with io.BytesIO() as file:
> pq.write_table(table, file)
> file.seek(0)
> pq.read_table(file) # Raises pa.ArrowInvalid
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)