jorisvandenbossche commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-760801124
Since both @pitrou and I got a bit confused about this `dictionaries`, we
should maybe try to further clarify the documentation around it. A more
explicit test could maybe also help (although I didn't check the C++ tests).
(can be a follow-up, since I think needs to get in for the release?)
As I understand it now:
- It's only needed in `ds.dataset(..)` when passing a schema, i.e. which
creates an actual Partitioning object, and not a PartitioningFactory (which
will infer the schema (and the dictionary values) from the file paths)
- In addition, it's only needed to specify it when reading, and not when
writing with a Partitioning (so can create and use a schema-based Partitioning
object without specifying the `dictionaries`).
This is the "only required when parsing paths" (the docstring says "or an
error will be raised in parsing"), since we don't need to parse paths when
writing. But for a user the "parsing paths" == "reading" (in practice) might
not necessarily be clear.
This behaviour of requiring explicit dictionaries when reading a dataset
with a Partitioning object with a schema including dictionary fields already
exists in 1.0 and 2.0 (only without any way to get around the error "No
dictionary provided for dictionary field", except by letting the partitioning
be discovered instead of specifying a schema). So that's certainly fine for
3.0.0 as well.
But, I am personally still wondering if we can't allow this for reading as
well to have those dictionaries unspecified but discovered, even when
specifying an explicit schema (eg it allows to have mixed dictionary /
non-dictionary partition fields). This actually also worked in pyarrow 0.17.0
(and I added a test about that in the PR fixing it
(https://github.com/apache/arrow/pull/6641#issuecomment-600746259), but that
got apparently lost in a rebase ;)), but I suppose this was changed after
ensuring that the dictionary-typed partition fields "know" the full dictionary
of all possible values the dataset
(https://github.com/apache/arrow/pull/7536#issuecomment-649500017).
I can open a JIRA about this to discuss further.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]