amol- commented on a change in pull request #11008: URL: https://github.com/apache/arrow/pull/11008#discussion_r697272484
########## File path: python/pyarrow/_dataset.pyx ########## @@ -1998,6 +1998,41 @@ cdef class PartitioningFactory(_Weakrefable): cdef inline shared_ptr[CPartitioningFactory] unwrap(self): return self.wrapped + @property + def type_name(self): + return frombytes(self.factory.type_name()) + + def create_with_schema(self, schema): Review comment: I agree that factories look specifically designed for reading, and the fact that `ds.partitioning()` returns a factory makes harder for a user to deal with writing. Mostly the multi-step process you described doesn't seem to exist in `pyarrow`, you invoke `write_dataset` or `read_dataset` and provide the partitioning that should be used for reading/writing to it. And while `read_dataset` deals with factories, `write_dataset` didn't. At the moment having `write_dataset` able to deal with factories when possible seemed to be the most reasonable solution that didn't require changes to our API and was convenient for users. The problem imho comes from the fact that building a partitioning requires a `schema`, but `ds.partitioning` allows you to omit it and in such case it will give you back a factory. I think that makes far more confusing and harder to use our api, I feel `ds.partitioning` should just have complained if it was unable to build a partitioning and we should have a dedicated "partitioning detector" or similar entity when you wanted to discover the partitioning from disk. In trying to make the API _more convenient_ it seems the final result is actually more confusing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org