[ 
https://issues.apache.org/jira/browse/ARROW-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260724#comment-16260724
 ] 

ASF GitHub Bot commented on ARROW-1830:
---------------------------------------

xhochy closed pull request #1340: ARROW-1830: [Python] Relax restriction that 
Parquet files in a dataset end in .parq or .parquet
URL: https://github.com/apache/arrow/pull/1340
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index 3023e1771..37da66280 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -421,10 +421,6 @@ def read(self, columns=None, nthreads=1, partitions=None,
         return table
 
 
-def _is_parquet_file(path):
-    return path.endswith('parq') or path.endswith('parquet')
-
-
 class PartitionSet(object):
     """A data structure for cataloguing the observed Parquet partitions at a
     particular level. So if we have
@@ -556,14 +552,14 @@ def _visit_level(self, level, base_path, part_keys):
         filtered_files = []
         for path in files:
             full_path = self.pathsep.join((base_path, path))
-            if _is_parquet_file(path):
-                filtered_files.append(full_path)
-            elif path.endswith('_common_metadata'):
+            if path.endswith('_common_metadata'):
                 self.common_metadata_path = full_path
             elif path.endswith('_metadata'):
                 self.metadata_path = full_path
-            elif not self._should_silently_exclude(path):
+            elif self._should_silently_exclude(path):
                 print('Ignoring path: {0}'.format(full_path))
+            else:
+                filtered_files.append(full_path)
 
         # ARROW-1079: Filter out "private" directories starting with underscore
         filtered_directories = [self.pathsep.join((base_path, x))
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index 522815fce..274ff458f 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -1020,7 +1020,7 @@ def _visit_level(base_dir, level, part_keys):
 
             if level == DEPTH - 1:
                 # Generate example data
-                file_path = pjoin(level_dir, 'data.parq')
+                file_path = pjoin(level_dir, guid())
 
                 filtered_df = _filter_partition(df, this_part_keys)
                 part_table = pa.Table.from_pandas(filtered_df)


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] Error when loading all the files in a dictionary
> ---------------------------------------------------------
>
>                 Key: ARROW-1830
>                 URL: https://issues.apache.org/jira/browse/ARROW-1830
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.7.1
>         Environment: Python 2.7.11 (default, Jan 22 2016, 08:29:18)  + 
> pyarrow 0.7.1
>            Reporter: DB Tsai
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> I can read one parquet file, but when I tried to read all the parquet files 
> in a folder, I got an error.
> {code:java}
> >>> data = 
> >>> pq.ParquetDataset('./aaa/part-00000-d8268e3a-4e65-41a3-a43e-01e0bf68ee86')
> >>> data = pq.ParquetDataset('./aaa/')
> Ignoring path: ./aaa//part-00000-d8268e3a-4e65-41a3-a43e-01e0bf68ee86
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/lib/python2.7/site-packages/pyarrow/parquet.py", line 638, 
> in __init__
>     self.validate_schemas()
>   File "/usr/local/lib/python2.7/site-packages/pyarrow/parquet.py", line 647, 
> in validate_schemas
>     self.schema = self.pieces[0].get_metadata(open_file).schema
> IndexError: list index out of range
> >>> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to