Xinyu Liu created SAMZA-1043:
--------------------------------
Summary: Samza performance improvements
Key: SAMZA-1043
URL: https://issues.apache.org/jira/browse/SAMZA-1043
Project: Samza
Issue Type: Improvement
Reporter: Xinyu Liu
Assignee: Xinyu Liu
Fix For: 0.12.0
In the recent experiments of samza batch job (consuming hdfs data on hadoop),
the results are subpar to map/reduce and spark. By looking at the metrics
closely, we found two basic problems:
1) Not enough data to process. This is spotted as the unprocessed message queue
length was zero for quite a lot of times.
2) Not process fast enough. We found samza performed closely in both median
size records (100B) and small record (10B), while spark can scale very well in
the small record (over 1M/s).
The first problem is solved by increasing the buffer size. This ticket is to
address the second problem, which contains three major improvements:
- Option to turn off timer metrics calculation: one of the main time spent in
samza processing turns out to be just keeping the timer metrics. While it is
useful in debugging, it becomes a bottleneck when running a stable job with
high performance. In my testing job which consumes 8M mock data, it took 30
secs with timer metrics on. After turning it off, it only took 14 secs.
- Java coding improvements: The AsyncRunLoop code can be further optimized for
efficiency. Some of the thread-safe data structure I am using is not for
optimal performance (Collections.synchronizedSet). I switched to use
CopyOnWriteArraySet, which has far better performance due to more reads and
small set size.
- In-order processing path improvements: AsyncRunLoop handles the callbacks
regardless of whether it's in-order or out-of-order (max concurrency > 1),
which incurs quite some cost. By simplying the logic for in-order handling, the
performance gains.
After all three improvements, my test job with mock input (8M messages) can be
processed within 8 sec, so it's 1M/s for one cpu core.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)