[GitHub] [arrow] amol- commented on a change in pull request #11008: ARROW-13755: [Python] Allow writing datasets using a partitioning that only specifies field_names

GitBox Fri, 27 Aug 2021 01:56:44 -0700


amol- commented on a change in pull request #11008:
URL: https://github.com/apache/arrow/pull/11008#discussion_r697272484




##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1998,6 +1998,41 @@ cdef class PartitioningFactory(_Weakrefable):
     cdef inline shared_ptr[CPartitioningFactory] unwrap(self):
         return self.wrapped
 
+    @property
+    def type_name(self):
+        return frombytes(self.factory.type_name())
+
+    def create_with_schema(self, schema):

Review comment:
       I agree that factories look specifically designed for reading, and the 
fact that `ds.partitioning()` returns a factory makes harder for a user to deal 
with writing. Mostly the multi-step process you described doesn't seem to exist 
in `pyarrow`, you invoke `write_dataset` or `read_dataset` and provide the 
partitioning that should be used for reading/writing to it. And while 
`read_dataset` deals with factories, `write_dataset` didn't.
   
   At the moment having `write_dataset` able to deal with factories when 
possible seemed to be the most reasonable solution that didn't require changes 
to our API and was convenient for users.
   
   The problem imho comes from the fact that building a partitioning requires a 
`schema`, but `ds.partitioning` allows you to omit it and in such case it will 
give you back a factory. I think that makes far more confusing and harder to 
use our api, I feel `ds.partitioning` should just have complained if it was 
unable to build a partitioning and we should have a dedicated "partitioning 
detector" or similar entity when you wanted to discover the partitioning from 
disk. In trying to make the API _more convenient_ it seems the final result is 
actually more confusing. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #11008: ARROW-13755: [Python] Allow writing datasets using a partitioning that only specifies field_names

Reply via email to