westonpace commented on issue #20385:
URL: https://github.com/apache/arrow/issues/20385#issuecomment-1380541325
So I've found the problem but haven't yet worked out the solution. The
segmentation fault occurs in `reader.cc` in `GetReader`. The handling for
extension type is:
```
if (type_id == ::arrow::Type::EXTENSION) {
auto storage_field = arrow_field->WithType(
checked_cast<const
ExtensionType&>(*arrow_field->type()).storage_type());
RETURN_NOT_OK(GetReader(field, storage_field, ctx, out));
*out = std::make_unique<ExtensionReader>(arrow_field, std::move(*out));
return Status::OK();
}
```
However, if the nested field is not loaded, then the recursive `GetReader`
call sets `out` to `nullptr` and this code creates an `ExtensionReader` with a
null storage reader. This later crashes.
The fix is, unfortunately, not as simple as returning null. The problem is
that the Parquet reader is trying to maintain the nested structure. As you see
in your example that works, `column.one` yields a partial struct:
```
assert table2.to_pylist() == [
{"column": {"one": 10}},
{"column": {"one": 20}},
{"column": {"one": 30}},
]
```
However, it is not clear that a partial "extension type" is a valid thing.
For example, imagine your extension type was a 2DPoint with "x" and "y". What
should be returned if the user loads `points.x`? We can't maintain structure
in that case.
I'm a little new to parquet and nested references so I don't know if there
is a syntax we can use to ask for the nested columns without structure. In
this case you would get:
```
assert table2.to_pylist() == [
{ "one": 10 },
{ "one": 20 },
{ "one": 30 }
]
```
I will put together a PR that at least returns an invalid status in this
case instead of a segmentation fault.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]