Hi all,

I have the following use case:
One job consists of reading from 500-2000 small bzipped logs that are on an
nfs. 
(Small means, that the zipped logs are between 0-100KB, average file size is
20KB.)

We read the log lines, do some transformations, and write them to one output
file.

When we do it in pure Python (running the Python script on one core):
-the time for 500 bzipped log files (6.5MB altogether) is about 5 seconds.
-the time for 2000 bzipped log files (25MB altogether) is about 20 seconds.

Because there will be many such jobs, I was thinking of trying Spark for
that purpose.
My preliminary findings and my questions:

*Even only counting the number of log lines with Spark is about 10 times
slower than the entire transformation done by the Python script.
*sc.textfile(list_of_filenames) appear to not perform well on small files,
why?
*sc.wholeTextFiles(path_to_files) performs better than sc.textfile, but does
not support bzipped files. However, also wholeTextFiles does not nearly
provide the speed of the Python script.

*The initialization of a Spark Context takes about 4 seconds. Sending a
Spark job to a cluster takes even longer. Is there a way to decrease this
initialization phase?

*Is my use case actually an appropriate use case for Spark?

Many thanks for your help and comments!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-very-small-files-appropriate-use-case-tp21583.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to