[jira] [Commented] (ARROW-1555) [Python] write_to_dataset on s3

ASF GitHub Bot (JIRA) Thu, 26 Oct 2017 19:59:37 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221635#comment-16221635
 ]


ASF GitHub Bot commented on ARROW-1555:
---------------------------------------

wesm closed pull request #1240: ARROW-1555 [Python] Implement Dask exists 
function
URL: https://github.com/apache/arrow/pull/1240
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/filesystem.py b/python/pyarrow/filesystem.py
index 8d2d8fcd3..926df0e30 100644
--- a/python/pyarrow/filesystem.py
+++ b/python/pyarrow/filesystem.py
@@ -135,6 +135,13 @@ def isfile(self, path):
         """
         raise NotImplementedError
 
+    def _isfilestore(self):
+        """
+        Returns True if this FileSystem is a unix-style file store with
+        directories.
+        """
+        raise NotImplementedError
+
     def read_parquet(self, path, columns=None, metadata=None, schema=None,
                      nthreads=1, use_pandas_metadata=False):
         """
@@ -209,6 +216,10 @@ def isdir(self, path):
     def isfile(self, path):
         return os.path.isfile(path)
 
+    @implements(FileSystem._isfilestore)
+    def _isfilestore(self):
+        return True
+
     @implements(FileSystem.exists)
     def exists(self, path):
         return os.path.exists(path)
@@ -247,10 +258,22 @@ def isdir(self, path):
     def isfile(self, path):
         raise NotImplementedError("Unsupported file system API")
 
+    @implements(FileSystem._isfilestore)
+    def _isfilestore(self):
+        """
+        Object Stores like S3 and GCSFS are based on key lookups, not true
+        file-paths
+        """
+        return False
+
     @implements(FileSystem.delete)
     def delete(self, path, recursive=False):
         return self.fs.rm(path, recursive=recursive)
 
+    @implements(FileSystem.exists)
+    def exists(self, path):
+        return self.fs.exists(path)
+
     @implements(FileSystem.mkdir)
     def mkdir(self, path):
         return self.fs.mkdir(path)
diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index 0a40f5fb7..9dcc30c8a 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -985,7 +985,7 @@ def write_to_dataset(table, root_path, partition_cols=None,
     else:
         fs = _ensure_filesystem(filesystem)
 
-    if not fs.exists(root_path):
+    if fs._isfilestore() and not fs.exists(root_path):
         fs.mkdir(root_path)
 
     if partition_cols is not None and len(partition_cols) > 0:
@@ -1004,7 +1004,7 @@ def write_to_dataset(table, root_path, 
partition_cols=None,
             subtable = Table.from_pandas(subgroup,
                                          preserve_index=preserve_index)
             prefix = "/".join([root_path, subdir])
-            if not fs.exists(prefix):
+            if fs._isfilestore() and not fs.exists(prefix):
                 fs.mkdir(prefix)
             outfile = compat.guid() + ".parquet"
             full_path = "/".join([prefix, outfile])
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index 09184cc05..a7fe98ce7 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -978,6 +978,7 @@ def _visit_level(base_dir, level, part_keys):
                 part_table = pa.Table.from_pandas(filtered_df)
                 with fs.open(file_path, 'wb') as f:
                     _write_table(part_table, f)
+                assert fs.exists(file_path)
             else:
                 _visit_level(level_dir, level + 1, this_part_keys)
 


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] write_to_dataset on s3
> -------------------------------
>
>                 Key: ARROW-1555
>                 URL: https://issues.apache.org/jira/browse/ARROW-1555
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Young-Jun Ko
>            Assignee: Florian Jetter
>            Priority: Trivial
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1555) [Python] write_to_dataset on s3

Reply via email to