Hello,

I am running Spark 1.4.0 on Mesos 0.22.1, and usually I run my jobs in
coarse-grained mode.

I have written some single-threaded standalone Scala applications for a
problem
that I am working on, and I am unable to get a Spark solution that comes
close
to the performance of this application. My hope was to sacrifice some
performance to get an easily scalable solution, but I'm finding that the
single-threaded implementations consistently outperform Spark even with a
couple
dozen cores, and I'm having trouble getting Spark to scale linearly.

All files are binary files with fixed-width records, ranging from about 40
bytes
to 200 bytes per record depending on the type. The files are already
partitioned
by 3 keys, with one file for each combination. Basically the layout is
/customer/day/partition_number. The ultimate goal is to read time series
events,
join in some smaller tables when processing those events, and write the
result
to parquet. For this discussion, I'm focusing on just a simple problem:
reading
and aggregating the events.

I started with a simple experiment to walk over all the events and sum the
value
of an integer field. I implemented two standalone solutions and a Spark
solution:

1) For each file, use a BufferedInputStream to iterate over each fixed-width
   row, copy the row to a Array[Byte], and then parse the one field out of
that
   array. This can process events at about 30 million/second.

2) Memory-map each file to a java.nio.MappedByteBuffer. Calculate the sum by
   directly selecting the integer field while iterating over the rows. This
   solution can process about 100-300 million events/second.

3) Use SparkContext.binaryRecords, map over the RDD[Array[Byte]] to parse or
   select the field, and then called sum on that.

Although performance is understandably much better when I use a memory
mapped
bytebuffer, I would expect my Spark solution to get the same per-core
throughput
as solution #1 above, where the record type is Array[Byte] and I'm using the
same approach to pull out the integer field from that byte array.

However, the Spark solution achieves only 1-2 million events/second on 1
core, 4
million events/second on 2 nodes with 4 cores each, and 8 million
events/second
on 6 nodes with 4 cores each. So, not only was the performance a fraction
of my
standalone application, but it can't even scale linearly to 6 nodes.

- Philip

Reply via email to