[GitHub] [arrow] vibhatha commented on a change in pull request #11911: ARROW-15019: [Python] Add bindings for new dataset writing options

GitBox Wed, 15 Dec 2021 19:10:18 -0800


vibhatha commented on a change in pull request #11911:
URL: https://github.com/apache/arrow/pull/11911#discussion_r770190430




##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3621,6 +3621,204 @@ def compare_tables_ignoring_order(t1, t2):
     assert not extra_file.exists()
 
 
+def _generate_random_int_array(size=4, min=1, max=10):
+    return np.random.randint(min, max, size)
+
+
+def _generate_data_and_columns(num_of_columns, records_per_row,
+                               unique_records=None):
+    data = []
+    column_names = []
+    if unique_records is None:
+        unique_records = records_per_row
+    for i in range(num_of_columns):
+        data.append(_generate_random_int_array(size=records_per_row,
+                                               min=1,
+                                               max=unique_records))
+        column_names.append("c" + str(i))
+    return data, column_names
+
+
+def _get_num_of_files_generated(base_directory):
+    file_dirs = os.listdir(base_directory)
+    number_of_files = 0
+    for _, file_dir in enumerate(file_dirs):

Review comment:
       `os.walk` is a nice suggestion. looking into comment from @westonpace on 
the same part, since we are only doing a non-nested partitioning 
`list(pathlib.Path(base_directory).glob('**/*.parquet'))` seems to be simpler 
and match this case. what do you think? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] vibhatha commented on a change in pull request #11911: ARROW-15019: [Python] Add bindings for new dataset writing options

Reply via email to