normalscene commented on pull request #5708: URL: https://github.com/apache/spark/pull/5708#issuecomment-902743967
Hello @yongtang , I am troubleshooting slow wholeTextFiles reading issue, and while digging through the function code, I came across this ticket number. I have a question regarding the comment made [here](https://github.com/apache/spark/blob/90cbf9ca3ed1f7f3271c3c8b592f22c5a5df2eee/core/src/main/scala/org/apache/spark/SparkContext.scala#L954). Basically the comment note talks about how one could speed up reading wholeTextFiles() from multiple paths. I can't understand what is meant by "path/*" in the comment note. Any pointers would be much appreciated. In my case - we are attempting to process our data by reading from 24 paths, simultaneously by providing a csv list in our gs bucket URL, and it took like 40 minutes to read and prepare the dataframe - which we feel is quite slow. And when the no. of paths is reduced to 10, it takes like one fourth time. My apologies that I am posting a question to you here as I couldn't find a way to put forward questions on the spark github repo. If there is an official way to ask question - please point me towards it and I shall adhere to the due process and get help from the community, officially. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org