dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876991565
To @cloud-fan , I tested like the following in 100k scale. I don't see any
**10x** regression which you mentioned. Given this result, could you provide
your specific procedure? Otherwise, I don't think Apache Spark has an issue.
1. Prepare **100k files** in a single directory
```bash
$ mkdir /tmp/100k
$ for i in {1..100000}; do touch /tmp/100k/$i.txt; done
$ aws s3 sync /tmp/100k s3://dongjoon/100k
$ aws s3 ls s3://dongjoon/100k/ --summarize | tail -n2
Total Objects: 100000
Total Size: 0
```
2. Build Apache Spark 4
```
$ NO_MANUAL=1 ./dev/make-distribution.sh -Phadoop-cloud,hive
```
3. Comparison
**Apache Spark 3.5.0**
```
$ bin/spark-shell \
-c
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
24/01/04 04:02:36 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
24/01/04 04:02:37 WARN Utils: Service 'SparkUI' could not bind on port 4040.
Attempting port 4041.
Spark context Web UI available at http://localhost:4041
Spark context available as 'sc' (master = local[*], app id =
local-1704369757182).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.9)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.time(new
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(spark, Seq(new
org.apache.hadoop.fs.Path("s3a://dongjoon/100k/")), Map.empty, None))
24/01/04 04:02:45 WARN MetricsConfig: Cannot locate configuration: tried
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
Time taken: 12310 ms
res0: org.apache.spark.sql.execution.datasources.InMemoryFileIndex =
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(s3a://dongjoon/100k)
```
**Apache Spark 4.0.0**
```
$ bin/spark-shell \
-c
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 4.0.0-SNAPSHOT
/_/
Using Scala version 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.9)
Type in expressions to have them evaluated.
Type :help for more information.
24/01/04 04:06:26 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id =
local-1704369986620).
Spark session available as 'spark'.
scala> spark.time(new
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(spark, Seq(new
org.apache.hadoop.fs.Path("s3a://dongjoon/100k/")), Map.empty, None))
24/01/04 04:06:29 WARN MetricsConfig: Cannot locate configuration: tried
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
24/01/04 04:06:29 WARN SDKV2Upgrade: Directly referencing AWS SDK V1
credential provider com.amazonaws.auth.profile.ProfileCredentialsProvider. AWS
SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
Time taken: 12857 ms
val res0: org.apache.spark.sql.execution.datasources.InMemoryFileIndex =
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(s3a://dongjoon/100k)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]