jorisvandenbossche commented on PR #36846:
URL: https://github.com/apache/arrow/pull/36846#issuecomment-1741172891

   @Fokko thanks for the updates!
   
   Going throught my previous reviews, summarizing here:
   
   Some follow-up ideas that I don't think need to be addressed here:
   - Mismatching timezones: a potential option could be to actually allow 
promoting timezones, and with the result using UTC (the actual time points will 
stay the same since we store UTC under the hood, you only loose the information 
about the specified timezone)
   - Promoting bools to numeric?
   - Exposing the fine-grained options in python (if we actually want to do 
this?)
   - We have some logic around promoting types in kernels (`DispatchBest`, 
implemented in `CommonNumeric` for numeric types, added in #9294). We should 
look into consolidating that logic (the kernels can use the "permissive" 
promotion).
   
   
   Then there are a few things that I would prefer to resolve here in this PR:
   
   - Do we need the `pyarrow.FieldMergeOptions` class? At the moment it 
actually doesn't really do anything, since we don't (yet) allow to tweak 
individual flags, only "permissive"/"default" (and the class itself actually 
also only exposes "permissive", not "default", as a static method to create 
that options object). Maybe we can just remove it for now, we can always add it 
back later if we want to expose individual options.  
     I just noticed that the repr is also failing, so if we keep it, we should 
fix and test that. 
   
   
   - For `concat_tables`, I find it confusing that a user still has to specify 
`promote=True` when they already specified `field_merge_options="permissive"` 
(because the latter only makes sense with promote=True). Essentially the 
`promote=False` can be seen as a "no promotion" field merge option?  
     I think with this new promotion capability, the historically added 
`promote` keyword is unfortunately badly chosen. Essentially it means: "unify 
schemas" (that's also how it is called on the C++ side), and by default it 
hardly does any promotion (only non-existing fields or null fields to any 
type). If we keep `promote` keyword, I would expect that `promote=True` would 
do actual, permissive promotion. 
     
     To be clear, this is not caused by this PR, but I would still prefer to 
get a nicer API. One option is to actually rename promote to `unify_schemas` 
(which is what it does). We can still keep `promote` as an alias for now for 
backwards compatibility. 
     
     And personally, I am also wondering if we shouldn't make "permissive" the 
default, when schema unification is enabled (in which case the "default" merge 
option is badly named ..). But that would actually be a change in behaviour.
   
   - The keyword names: `pa.concat_tables` uses `field_merge_options`, 
`pa.unify_schemas` uses `options`. Should we try to align those names? I 
mentioned that I would like `promote_options`, which I think is more 
descriptive, but on the other hand deviates from the C++ termininology ..
   
   The last two bullet points are more about the final Python API, and doesn't 
impact the large part of the changes in the PR (the actual C++ changes), so 
maybe I shouldn't block the PR on those, and I am also fine with discussing 
those in a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to