jorisvandenbossche commented on pull request #9130: URL: https://github.com/apache/arrow/pull/9130#issuecomment-760801124
Since both @pitrou and I got a bit confused about this `dictionaries`, we should maybe try to further clarify the documentation around it. A more explicit test could maybe also help (although I didn't check the C++ tests). (can be a follow-up, since I think needs to get in for the release?) As I understand it now: - It's only needed in `ds.dataset(..)` when passing a schema, i.e. which creates an actual Partitioning object, and not a PartitioningFactory (which will infer the schema (and the dictionary values) from the file paths) - In addition, it's only needed to specify it when reading, and not when writing with a Partitioning (so can create and use a schema-based Partitioning object without specifying the `dictionaries`). This is the "only required when parsing paths" (the docstring says "or an error will be raised in parsing"), since we don't need to parse paths when writing. But for a user the "parsing paths" == "reading" (in practice) might not necessarily be clear. This behaviour of requiring explicit dictionaries when reading a dataset with a Partitioning object with a schema including dictionary fields already exists in 1.0 and 2.0 (only without any way to get around the error "No dictionary provided for dictionary field", except by letting the partitioning be discovered instead of specifying a schema). So that's certainly fine for 3.0.0 as well. But, I am personally still wondering if we can't allow this for reading as well to have those dictionaries unspecified but discovered, even when specifying an explicit schema (eg it allows to have mixed dictionary / non-dictionary partition fields). This actually also worked in pyarrow 0.17.0 (and I added a test about that in the PR fixing it (https://github.com/apache/arrow/pull/6641#issuecomment-600746259), but that got apparently lost in a rebase ;)), but I suppose this was changed after ensuring that the dictionary-typed partition fields "know" the full dictionary of all possible values the dataset (https://github.com/apache/arrow/pull/7536#issuecomment-649500017). I can open a JIRA about this to discuss further. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org