Fokko commented on PR #36846: URL: https://github.com/apache/arrow/pull/36846#issuecomment-1744547621
Thanks @jorisvandenbossche for summarizing the pending issues. I was rebasing to see if the LLVM stuff has been fixed. > Do we need the pyarrow.FieldMergeOptions class? At the moment it actually doesn't really do anything, since we don't (yet) allow to tweak individual flags, only "permissive"/"default" (and the class itself actually also only exposes "permissive", not "default", as a static method to create that options object). Maybe we can just remove it for now, we can always add it back later if we want to expose individual options. I don't think it adds much value. Just accepting a string is much clearer to the end-user, and I think the majority of the people don't the fine-grained tuning. > For concat_tables, I find it confusing that a user still has to specify promote=True when they already specified field_merge_options="permissive" (because the latter only makes sense with promote=True). Essentially the promote=False can be seen as a "no promotion" field merge option? I think with this new promotion capability, the historically added promote keyword is unfortunately badly chosen. Essentially it means: "unify schemas" (that's also how it is called on the C++ side), and by default it hardly does any promotion (only non-existing fields or null fields to any type). If we keep promote keyword, I would expect that promote=True would do actual, permissive promotion. To be clear, this is not caused by this PR, but I would still prefer to get a nicer API. One option is to actually rename promote to unify_schemas (which is what it does). We can still keep promote as an alias for now for backwards compatibility. This is a good point, and I agree that `promote` is poorly chosen. You could have the situation that you don't want to add null fields, but you allow the promotion of the fields. I've added the `none` option as you suggested, and I think this is much cleaner. Also added support for `promote` if people still rely on it. It will emit a warning to let users know that they should switch to `promote_options`. > The keyword names: pa.concat_tables uses field_merge_options, pa.unify_schemas uses options. Should we try to align those names? I mentioned that I would like promote_options, which I think is more descriptive, but on the other hand deviates from the C++ terminology .. I like `promote_options` as well. Aligned the names there. > The last two bullet points are more about the final Python API, and doesn't impact the large part of the changes in the PR (the actual C++ changes), so maybe I shouldn't block the PR on those, and I am also fine with discussing those in a follow-up. We should not merge anything that we're going to rip out anyway, but I'm happy to help with some of the follow-ups. Getting this one in would be great since it is quite big already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
