Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

via GitHub Thu, 04 Jan 2024 04:07:40 -0800


dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876991565


   To @cloud-fan , I tested like the following in 100k scale. I don't see any 
**10x** regression which you mentioned. Given this result, could you provide 
your specific procedure? Otherwise, I don't think Apache Spark has an issue.
   
   1. Prepare **100k files** in a single directory
   ```bash
   $ mkdir /tmp/100k
   
   $ for i in {1..100000}; do touch /tmp/100k/$i.txt; done
   
   $ aws s3 sync /tmp/100k s3://dongjoon/100k
   
   $ aws s3 ls s3://dongjoon/100k/ --summarize | tail -n2
   Total Objects: 100000
      Total Size: 0
   ```
   
   2. Build Apache Spark 4
   ```
   $ NO_MANUAL=1 ./dev/make-distribution.sh -Phadoop-cloud,hive
   ```
   
   3. Comparison
   
   **Apache Spark 3.5.0**
   ```
   $ bin/spark-shell \
   -c 
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   24/01/04 04:02:36 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   24/01/04 04:02:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
   Spark context Web UI available at http://localhost:4041
   Spark context available as 'sc' (master = local[*], app id = 
local-1704369757182).
   Spark session available as 'spark'.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
         /_/
   
   Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.9)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> spark.time(new 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(spark, Seq(new 
org.apache.hadoop.fs.Path("s3a://dongjoon/100k/")), Map.empty, None))
   24/01/04 04:02:45 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
   Time taken: 12310 ms
   res0: org.apache.spark.sql.execution.datasources.InMemoryFileIndex = 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(s3a://dongjoon/100k)
   ```
   
   **Apache Spark 4.0.0**
   ```
   $ bin/spark-shell \
   -c 
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
         /_/
   
   Using Scala version 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.9)
   Type in expressions to have them evaluated.
   Type :help for more information.
   24/01/04 04:06:26 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Spark context Web UI available at http://localhost:4040
   Spark context available as 'sc' (master = local[*], app id = 
local-1704369986620).
   Spark session available as 'spark'.
   
   scala> spark.time(new 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(spark, Seq(new 
org.apache.hadoop.fs.Path("s3a://dongjoon/100k/")), Map.empty, None))
   24/01/04 04:06:29 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
   24/01/04 04:06:29 WARN SDKV2Upgrade: Directly referencing AWS SDK V1 
credential provider com.amazonaws.auth.profile.ProfileCredentialsProvider. AWS 
SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
   Time taken: 12857 ms
   val res0: org.apache.spark.sql.execution.datasources.InMemoryFileIndex = 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(s3a://dongjoon/100k)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Reply via email to