jorisvandenbossche commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-760801124


   Since both @pitrou and I got a bit confused about this `dictionaries`, we 
should maybe try to further clarify the documentation around it. A more 
explicit test could maybe also help (although I didn't check the C++ tests).  
   (can be a follow-up, since I think needs to get in for the release?)
   
   As I understand it now:
   
   - It's only needed in `ds.dataset(..)` when passing a schema, i.e. which 
creates an actual Partitioning object, and not a PartitioningFactory (which 
will infer the schema (and the dictionary values) from the file paths)
   - In addition, it's only needed to specify it when reading, and not when 
writing with a Partitioning (so can create and use a schema-based Partitioning 
object without specifying the `dictionaries`).  
     This is the "only required when parsing paths" (the docstring says "or an 
error will be raised in parsing"), since we don't need to parse paths when 
writing. But for a user the "parsing paths" == "reading" (in practice) might 
not necessarily be clear.
   
   This behaviour of requiring explicit dictionaries when reading a dataset 
with a Partitioning object with a schema including dictionary fields already 
exists in 1.0 and 2.0 (only without any way to get around the error "No 
dictionary provided for dictionary field", except by letting the partitioning 
be discovered instead of specifying a schema). So that's certainly fine for 
3.0.0 as well. 
    
   But, I am personally still wondering if we can't allow this for reading as 
well to have those dictionaries unspecified but discovered, even when 
specifying an explicit schema (eg it allows to have mixed dictionary / 
non-dictionary partition fields). This actually also worked in pyarrow 0.17.0 
(and I added a test about that in the PR fixing it 
(https://github.com/apache/arrow/pull/6641#issuecomment-600746259), but that 
got apparently lost in a rebase ;)), but I suppose this was changed after 
ensuring that the dictionary-typed partition fields "know" the full dictionary 
of all possible values the dataset 
(https://github.com/apache/arrow/pull/7536#issuecomment-649500017). 
   I can open a JIRA about this to discuss further.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to