[ https://issues.apache.org/jira/browse/HADOOP-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011692#comment-18011692 ]
ASF GitHub Bot commented on HADOOP-13230: ----------------------------------------- liapengpony commented on code in PR #2149: URL: https://github.com/apache/hadoop/pull/2149#discussion_r2250007813 ########## hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java: ########## @@ -4086,25 +4175,41 @@ public boolean exists(Path f) throws IOException { } /** - * Override superclass so as to add statistic collection. + * Optimized probe for a path referencing a dir. + * Even though it is optimized to a single HEAD, applications + * should not over-use this method...it is all too common. * {@inheritDoc} */ @Override @SuppressWarnings("deprecation") public boolean isDirectory(Path f) throws IOException { Review Comment: @steveloughran it looks the change to this function was meant to optimize performance, but I am experiencing performance regression when upgrading spark version from 3.1.2 to 3.5.1, and I found it caused by this change. When I do a spark.read.parquet('s3a://path/to/1.parquet', ..., 's3a://path/to/10000.parquet') before this change, ONLY HEAD requests are sent to build the DataFrame. However, after this change, LIST requests are sent, which is significantly slower as I am reading from quite a lot of parquets. The docstring "it is optimized to a single HEAD" also confuses me because StatusProbeEnum.DIRECTORIES is just an alias for StatusProbeEnum.LIST_ONLY. Am I missing anything here? > S3A to optionally retain directory markers > ------------------------------------------ > > Key: HADOOP-13230 > URL: https://issues.apache.org/jira/browse/HADOOP-13230 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 2.9.0 > Reporter: Aaron Fabbri > Assignee: Steve Loughran > Priority: Major > Labels: pull-request-available > Fix For: 3.3.1 > > Attachments: 2020-02-Fixing the S3A directory marker problem.pdf > > Time Spent: 50m > Remaining Estimate: 0h > > Users of s3a may not realize that, in some cases, it does not interoperate > well with other s3 tools, such as the AWS CLI. (See HIVE-13778, IMPALA-3558). > Specifically, if a user: > - Creates an empty directory with hadoop fs -mkdir s3a://bucket/path > - Copies data into that directory via another tool, i.e. aws cli. > - Tries to access the data in that directory with any Hadoop software. > Then the last step fails because the fake empty directory blob that s3a wrote > in the first step, causes s3a (listStatus() etc.) to continue to treat that > directory as empty, even though the second step was supposed to populate the > directory with data. > I wanted to document this fact for users. We may mark this as not-fix, "by > design".. May also be interesting to brainstorm solutions and/or a config > option to change the behavior if folks care. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org