[GitHub] [spark] normalscene commented on pull request #5708: [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input

GitBox Fri, 20 Aug 2021 07:44:18 -0700


normalscene commented on pull request #5708:
URL: https://github.com/apache/spark/pull/5708#issuecomment-902743967

Hello @yongtang , I am troubleshooting slow wholeTextFiles reading issue,
and while digging through the function code, I came across this ticket number.

I have a question regarding the comment made
[here](https://github.com/apache/spark/blob/90cbf9ca3ed1f7f3271c3c8b592f22c5a5df2eee/core/src/main/scala/org/apache/spark/SparkContext.scala#L954).
Basically the comment note talks about how one could speed up reading
wholeTextFiles() from multiple paths. I can't understand what is meant by
"path/&#42;" in the comment note. Any pointers would be much appreciated.

In my case - we are attempting to process our data by reading from 24 paths,
simultaneously by providing a csv list in our gs bucket URL, and it took like
40 minutes to read and prepare the dataframe - which we feel is quite slow. And
when the no. of paths is reduced to 10, it takes like one fourth time.

My apologies that I am posting a question to you here as I couldn't find a
way to put forward questions on the spark github repo.
If there is an official way to ask question - please point me towards it and
I shall adhere to the due process and get help from the community, officially.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] normalscene commented on pull request #5708: [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input

Reply via email to