[ 
https://issues.apache.org/jira/browse/ARROW-15280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538090#comment-17538090
 ] 

Joris Van den Bossche commented on ARROW-15280:
-----------------------------------------------

bq. The code in pyarrow where this is mentioned looks to be in the legacy 
ParquetDataset code path, which predates the Dataset API we use in R. I believe 
this is being deprecated (right Joris Van den Bossche?). So it's not a drop-in 
fix for R.

Yes, that's correct (so this is actually also an issue for python)

bq. Perhaps we should add an ignore_suffixes option, or some other interface 
for filtering like this (cc Ben Kietzman)

I was thinking exactly the same. Based on the python snippet from the legacy 
code, it is clear that it's ignoring some files both based on prefixes and 
suffixes. 

It might be possible to have some more advanced / smarter (callback based?) 
filename filter option, but I suppose that a simpler prefix+suffix ignore 
options will cover almost all use cases?

> [R] Expose FileSystemFactoryOptions
> -----------------------------------
>
>                 Key: ARROW-15280
>                 URL: https://issues.apache.org/jira/browse/ARROW-15280
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.1
>            Reporter: Bob Rudis
>            Assignee: Neal Richardson
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 9.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-4406 notes that:
> >Currently reading parquet files generated by Hadoop (EMR) from S3 fails with 
> >"ValueError: >Found files in an intermediate directory" because of the 
> >_$folder$ empty files.
> This was fixed in the pyarrow but R still has this issue.
> The R side does not seem to have something similar to:
> {{  def _should_silently_exclude(self, file_name):}}
> {{    return (file_name.endswith('.crc') or # Checksums}}
> {{            file_name.endswith('_$folder$') or # HDFS directories in S3}}
> {{            file_name.startswith('.') or # Hidden files starting with .}}
> {{            file_name.startswith('_') or # Hidden files starting with _}}
> {{            file_name in EXCLUDED_PARQUET_PATHS)}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to