westonpace commented on issue #20385:
URL: https://github.com/apache/arrow/issues/20385#issuecomment-1380541325

   So I've found the problem but haven't yet worked out the solution.  The 
segmentation fault occurs in `reader.cc` in `GetReader`.  The handling for 
extension type is:
   
   ```
     if (type_id == ::arrow::Type::EXTENSION) {
       auto storage_field = arrow_field->WithType(
           checked_cast<const 
ExtensionType&>(*arrow_field->type()).storage_type());
       RETURN_NOT_OK(GetReader(field, storage_field, ctx, out));
       *out = std::make_unique<ExtensionReader>(arrow_field, std::move(*out));
       return Status::OK();
     }
   ```
   
   However, if the nested field is not loaded, then the recursive `GetReader` 
call sets `out` to `nullptr` and this code creates an `ExtensionReader` with a 
null storage reader.  This later crashes.
   
   The fix is, unfortunately, not as simple as returning null.  The problem is 
that the Parquet reader is trying to maintain the nested structure.  As you see 
in your example that works, `column.one` yields a partial struct:
   
   ```
   assert table2.to_pylist() == [
       {"column": {"one": 10}},
       {"column": {"one": 20}},
       {"column": {"one": 30}},
   ]
   ```
   
   However, it is not clear that a partial "extension type" is a valid thing.  
For example, imagine your extension type was a 2DPoint with "x" and "y".  What 
should be returned if the user loads `points.x`?  We can't maintain structure 
in that case.
   
   I'm a little new to parquet and nested references so I don't know if there 
is a syntax we can use to ask for the nested columns without structure.  In 
this case you would get:
   
   ```
   assert table2.to_pylist() == [
     { "one": 10 },
     { "one": 20 },
     { "one": 30 }
   ]
   ```
   
   I will put together a PR that at least returns an invalid status in this 
case instead of a segmentation fault.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to