[ https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591615#comment-14591615 ]
Ewan Leith commented on SPARK-8437: ----------------------------------- Thanks, I wasn't sure if it was Hadoop or Spark specific, initially I thought it was S3 related but it happens all over. If it is Hadoop, I don't know if it would be feasible for Spark to check if a directory has been given and add a wildcard in the background, that might not give the desired effect, but otherwise there's various doc changes to make. > Using directory path without wildcard for filename slow for large number of > files with wholeTextFiles and binaryFiles > --------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-8437 > URL: https://issues.apache.org/jira/browse/SPARK-8437 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 1.3.1, 1.4.0 > Environment: Ubuntu 15.04 + local filesystem > Amazon EMR + S3 + HDFS > Reporter: Ewan Leith > Priority: Minor > > When calling wholeTextFiles or binaryFiles with a directory path with 10,000s > of files in it, Spark hangs for a few minutes before processing the files. > If you add a * to the end of the path, there is no delay. > This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and > on S3. > To reproduce, create a directory with 50,000 files in it, then run: > val a = sc.binaryFiles("file:/path/to/files/") > a.count() > val b = sc.binaryFiles("file:/path/to/files/*") > b.count() > and monitor the different startup times. > For example, in the spark-shell these commands are pasted in together, so the > delay at f.count() is from 10:11:08 t- 10:13:29 to output "Total input paths > to process : 49999", then until 10:15:42 to being processing files: > scala> val f = sc.binaryFiles("file:/home/ewan/large/") > 15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with > curMem=0, maxMem=278019440 > 15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 156.9 KB, free 265.0 MB) > 15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with > curMem=160616, maxMem=278019440 > 15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 16.9 KB, free 265.0 MB) > 15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:40430 (size: 16.9 KB, free: 265.1 MB) > 15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at > <console>:21 > f: org.apache.spark.rdd.RDD[(String, > org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ > BinaryFileRDD[0] at binaryFiles at <console>:21 > scala> f.count() > 15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 49999 > 15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 49999 > 15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node > allocation with : CompletedNodes: 1, size left: 0 > 15/06/18 10:15:42 INFO SparkContext: Starting job: count at <console>:24 > 15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at <console>:24) with > 49999 output partitions (allowLocal=false) > 15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at > <console>:24) > 15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List() > Adding a * to the end of the path removes the delay: > scala> val f = sc.binaryFiles("file:/home/ewan/large/*") > 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with > curMem=0, maxMem=278019440 > 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 156.9 KB, free 265.0 MB) > 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with > curMem=160616, maxMem=278019440 > 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 16.9 KB, free 265.0 MB) > 15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:42825 (size: 16.9 KB, free: 265.1 MB) > 15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at > <console>:21 > f: org.apache.spark.rdd.RDD[(String, > org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* > BinaryFileRDD[0] at binaryFiles at <console>:21 > scala> f.count() > 15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 49999 > 15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 49999 > 15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node > allocation with : CompletedNodes: 1, size left: 0 > 15/06/18 10:08:35 INFO SparkContext: Starting job: count at <console>:24 > 15/06/18 10:08:35 INFO DAGScheduler: Got job 0 (count at <console>:24) with > 49999 output partitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org