[jira] [Commented] (ARROW-1805) [Python] ignore non-parquet files when exploring dataset

2017-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257877#comment-16257877
 ] 

ASF GitHub Bot commented on ARROW-1805:
---

wesm closed pull request #1314: ARROW-1805: [Python] Ignore special private 
files when traversing ParquetDataset
URL: https://github.com/apache/arrow/pull/1314
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index 9e0749bb3..3023e1771 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -573,7 +573,7 @@ def _visit_level(self, level, base_path, part_keys):
 filtered_files.sort()
 filtered_directories.sort()
 
-if len(files) > 0 and len(filtered_directories) > 0:
+if len(filtered_files) > 0 and len(filtered_directories) > 0:
 raise ValueError('Found files in an intermediate '
  'directory: {0}'.format(base_path))
 elif len(filtered_directories) > 0:
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index 1df80acc0..522815fce 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -1027,8 +1027,11 @@ def _visit_level(base_dir, level, part_keys):
 with fs.open(file_path, 'wb') as f:
 _write_table(part_table, f)
 assert fs.exists(file_path)
+
+_touch(pjoin(level_dir, '_SUCCESS'))
 else:
 _visit_level(level_dir, level + 1, this_part_keys)
+_touch(pjoin(level_dir, '_SUCCESS'))
 
 _visit_level(base_dir, 0, [])
 
@@ -1101,6 +1104,11 @@ def _filter_partition(df, part_keys):
 return df[predicate].drop(to_drop, axis=1)
 
 
+def _touch(path):
+with open(path, 'wb'):
+pass
+
+
 @parquet
 def test_read_multiple_files(tmpdir):
 import pyarrow.parquet as pq
@@ -1128,8 +1136,7 @@ def test_read_multiple_files(tmpdir):
 paths.append(path)
 
 # Write a _SUCCESS.crc file
-with open(pjoin(dirpath, '_SUCCESS.crc'), 'wb') as f:
-f.write(b'0')
+_touch(pjoin(dirpath, '_SUCCESS.crc'))
 
 def read_multiple_files(paths, columns=None, nthreads=None, **kwargs):
 dataset = pq.ParquetDataset(paths, **kwargs)


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] ignore non-parquet files when exploring dataset
> 
>
> Key: ARROW-1805
> URL: https://issues.apache.org/jira/browse/ARROW-1805
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Manuel Valdés
>Assignee: Manuel Valdés
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> When exploring a ParquetDataset, some files 
> (_metadata,_common_metadata,_SUCCESS) should be ignored when determining if a 
> directory follows a valid structure



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1805) [Python] ignore non-parquet files when exploring dataset

2017-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257870#comment-16257870
 ] 

ASF GitHub Bot commented on ARROW-1805:
---

wesm commented on issue #1314: ARROW-1805: [Python] Ignore special private 
files when traversing ParquetDataset
URL: https://github.com/apache/arrow/pull/1314#issuecomment-345410541
 
 
   Plasma tests are failing here on macOS, @robertnishihara or @pcmoritz can 
you take a look?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] ignore non-parquet files when exploring dataset
> 
>
> Key: ARROW-1805
> URL: https://issues.apache.org/jira/browse/ARROW-1805
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Manuel Valdés
>Assignee: Manuel Valdés
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> When exploring a ParquetDataset, some files 
> (_metadata,_common_metadata,_SUCCESS) should be ignored when determining if a 
> directory follows a valid structure



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1805) [Python] ignore non-parquet files when exploring dataset

2017-11-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250043#comment-16250043
 ] 

ASF GitHub Bot commented on ARROW-1805:
---

wesm commented on issue #1314: ARROW-1805: [Python] Ignore special private 
files when traversing ParquetDataset
URL: https://github.com/apache/arrow/pull/1314#issuecomment-344029970
 
 
   Thanks for catching this. Can you add a unit test?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] ignore non-parquet files when exploring dataset
> 
>
> Key: ARROW-1805
> URL: https://issues.apache.org/jira/browse/ARROW-1805
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Manuel Valdés
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> When exploring a ParquetDataset, some files 
> (_metadata,_common_metadata,_SUCCESS) should be ignored when determining if a 
> directory follows a valid structure



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)