zeroshade commented on issue #34330: URL: https://github.com/apache/arrow/issues/34330#issuecomment-1443883987
@yevgenypats Some easy directions: * For the most part, nearly all direct arrow<-->parquet handling is kept solely to the `pqarrow` package (*nearly*) so most of the work will be there with the exception of dealing with reading via `file/record_reader.go` * You'll need to make sure to properly handle the case where the arrow schema gets stored in `pqarrow/schema.go` and make sure that the underlying storage type of the extension type matches the physical/logical type of the parquet file * The easiest way to handle extension types in *most* cases is to just throw something at the top of the type switch which checks for the extension ID and if it's an extension type, then set the local type to use the underlying storage type for example, look at `arrow/compute/internal/exec/span.go` at line 170 or look at how the ipc handling handles extension types. * Pretty much all arrow writing is handled through the big type switch in `writeDenseArrow` in `pqarrow/encode_arrow.go` while reading is managed in `column_readers.go` * Don't forget to also handle extension types in `path_builder.go` which is how it determines the leaf arrays for writing. Honestly you can probably just search for `arrow.EXTENSION` in any file of `pqarrow/*` and you'll find most, if not all, the spots that need to be added as I tried to explicitly return a not implemented error in all cases for extension types rather than letting it go to the default in order to make it easier to find the spots later. Don't be afraid to put up a WIP PR if you need some assistance on it and I'll be happy to help out when I can. Thanks again!! I'm looking forward to the PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
