Bob Rudis created ARROW-15280:
---------------------------------
Summary: Ignore "*_$folder$" files on S3 in {arrow} R package
Key: ARROW-15280
URL: https://issues.apache.org/jira/browse/ARROW-15280
Project: Apache Arrow
Issue Type: Improvement
Components: R
Affects Versions: 6.0.1
Reporter: Bob Rudis
ARROW-4406 notes that:
>Currently reading parquet files generated by Hadoop (EMR) from S3 fails with
>"ValueError: >Found files in an intermediate directory" because of the
>_$folder$ empty files.
This was fixed in the pyarrow but R still has this issue.
The R side does not seem to have something similar to:
{{ def _should_silently_exclude(self, file_name):}}
{{ return (file_name.endswith('.crc') or # Checksums}}
{{ file_name.endswith('_$folder$') or # HDFS directories in S3}}
{{ file_name.startswith('.') or # Hidden files starting with .}}
{{ file_name.startswith('_') or # Hidden files starting with _}}
{{ file_name in EXCLUDED_PARQUET_PATHS)}}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)