[ https://issues.apache.org/jira/browse/SPARK-53075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-53075: ---------------------------------- Description: Java implementations are faster. **SAMPLE DATA** {code} scala> val array = new java.util.ArrayList[String]() val array: java.util.ArrayList[String] = [] scala> (1 to 100_000_000).foreach { _ => array.add("a") } {code} **BEFORE (WRITE)** {code} scala> spark.time(org.apache.commons.io.FileUtils.writeLines(new java.io.File("/tmp/text"), array)) Time taken: 5013 ms {code} **AFTER (WRITE)** {code} scala> spark.time(java.nio.file.Files.write(java.nio.file.Paths.get("/tmp/text"), array)) Time taken: 1191 ms {code} **BEFORE(READ)** {code} scala> spark.time(org.apache.commons.io.FileUtils.readLines(new java.io.File("/tmp/text"))) Time taken: 2377 ms {code} **AFTER(READ)** {code} scala> spark.time(java.nio.file.Files.readAllLines(java.nio.file.Paths.get("/tmp/text"))) Time taken: 2279 ms {code} > Use Java `Files.readAllLines/write` instead of `FileUtils.(read|write)Lines` > ---------------------------------------------------------------------------- > > Key: SPARK-53075 > URL: https://issues.apache.org/jira/browse/SPARK-53075 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 4.1.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun > Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > > Java implementations are faster. > **SAMPLE DATA** > {code} > scala> val array = new java.util.ArrayList[String]() > val array: java.util.ArrayList[String] = [] > scala> (1 to 100_000_000).foreach { _ => array.add("a") } > {code} > **BEFORE (WRITE)** > {code} > scala> spark.time(org.apache.commons.io.FileUtils.writeLines(new > java.io.File("/tmp/text"), array)) > Time taken: 5013 ms > {code} > **AFTER (WRITE)** > {code} > scala> > spark.time(java.nio.file.Files.write(java.nio.file.Paths.get("/tmp/text"), > array)) > Time taken: 1191 ms > {code} > **BEFORE(READ)** > {code} > scala> spark.time(org.apache.commons.io.FileUtils.readLines(new > java.io.File("/tmp/text"))) > Time taken: 2377 ms > {code} > **AFTER(READ)** > {code} > scala> > spark.time(java.nio.file.Files.readAllLines(java.nio.file.Paths.get("/tmp/text"))) > Time taken: 2279 ms > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org