aboderinsamuel commented on PR #50145: URL: https://github.com/apache/arrow/pull/50145#issuecomment-4685480173
Thanks @AlenkaF! Happy to dig in, agreed these are better hashed out up front. My take, keeping this PR scoped to FixedShapeTensor and treating the rest as follow-ups: 1. Other canonical extension types I think the same idea fits them, but the right pandas dtype isn't always ArrowDtype. pd.ArrowDtype(self) is a good generic default (faithful, round-trips, no extra plumbing), yet some types have a more natural mapping worth weighing case by case, e.g. bool8 → a boolean dtype, while json/uuid/opaque could stay ArrowDtype. So I'd lean toward implementing it per canonical type (defaulting to ArrowDtype, overriding where a native dtype is clearly better) rather than one blanket implementation on BaseExtensionType. That keeps each behavior change small and reviewable, and avoids silently changing to_pandas for user-defined extension types — which feels more like the territory of the as_py fallback in [#33134](https://github.com/aboderinsamuel/arrow/issues/33134). 2. Implications for to_pandas Returning a dtype with __from_arrow__ does change to_pandas/Table.to_pandas for that type: instead of the current fallback (storage → object/numpy; e.g. tensors become an object column of flattened ndarrays), you get a faithful extension-typed column. What I see: ✅ round-trips: to_pandas → from_pandas now preserves the extension type (object dtype loses it). ✅ split_blocks=True works instead of raising KeyError (the original bug). ⚠️ user-facing change: anyone relying on the old object/numpy output now gets ArrowDtype — needs a changelog note. types_mapper still takes precedence (it's checked before to_pandas_dtype in both _get_extension_dtypes and _array_like_to_pandas), so explicit user mappings are unaffected. Only engages on pandas ≥ 2.1 (reliable ArrowDtype extension blocks, GH-35821); older pandas keeps today's fallback. Per-type rollout + the version gate keep the blast radius controlled. 3. Docstrings Agreed, worth doing. BaseExtensionType and its subclasses currently inherit to_pandas_dtype (and friends) from DataType with no mention of extension behavior. I'd be glad to take a follow-up documenting the to_pandas_dtype/to_pandas behavior on BaseExtensionType and per-type specifics. If it helps, I can open a tracking issue capturing this broader direction (canonical-type mappings + docstring cleanup) so it isn't buried here, and keep this PR focused on FixedShapeTensor. Does that split sound right to you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
