[ 
https://issues.apache.org/jira/browse/SAMZA-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyu Liu updated SAMZA-1043:
-----------------------------
    Description: 
In the recent experiments of samza batch job (consuming hdfs data on hadoop), 
the results are subpar to map/reduce and spark. By looking at the metrics 
closely, we found two basic problems:
1) Not enough data to process. This is spotted as the unprocessed message queue 
length was zero for quite a lot of times.
2) Not process fast enough. We found samza performed closely in both median 
size records (100B) and small record (10B), while spark can scale very well in 
the small record (over 1M/s).

The first problem is solved by increasing the buffer size. This ticket is to 
address the second problem, which contains three major improvements:

- Option to turn off timer metrics calculation: one of the main time spent in 
samza processing turns out to be just keeping the timer metrics. While it is 
useful in debugging, it becomes a bottleneck when running a stable job with 
high performance. In my testing job which consumes 8M mock data, it took 30 
secs with timer metrics on. After turning it off, it only took 14 secs.

- Java coding improvements: The AsyncRunLoop code can be further optimized for 
efficiency. Some of the thread-safe data structure I am using is not for 
optimal performance (Collections.synchronizedSet). I switched to use 
CopyOnWriteArraySet, which has far better performance due to more reads and 
small set size.

- In-order processing path improvements: AsyncRunLoop handles the callbacks 
regardless of whether it's in-order or out-of-order (max concurrency > 1), 
which incurs quite some cost. By simplying the logic for in-order handling, the 
performance gains.

After all three improvements, my test job with mock input (8M messages) can be 
processed within 8 sec, so it's 1M/s for one cpu core. 

For the performance benchmark jobs running in Hadoop, we also see a 4 times 
improvement with all the fixes above. Please take a look at the attached 
spreedsheet (see the numbers with fix(turn off the timing metrics) and fix2(all 
three together).

  was:
In the recent experiments of samza batch job (consuming hdfs data on hadoop), 
the results are subpar to map/reduce and spark. By looking at the metrics 
closely, we found two basic problems:
1) Not enough data to process. This is spotted as the unprocessed message queue 
length was zero for quite a lot of times.
2) Not process fast enough. We found samza performed closely in both median 
size records (100B) and small record (10B), while spark can scale very well in 
the small record (over 1M/s).

The first problem is solved by increasing the buffer size. This ticket is to 
address the second problem, which contains three major improvements:

- Option to turn off timer metrics calculation: one of the main time spent in 
samza processing turns out to be just keeping the timer metrics. While it is 
useful in debugging, it becomes a bottleneck when running a stable job with 
high performance. In my testing job which consumes 8M mock data, it took 30 
secs with timer metrics on. After turning it off, it only took 14 secs.

- Java coding improvements: The AsyncRunLoop code can be further optimized for 
efficiency. Some of the thread-safe data structure I am using is not for 
optimal performance (Collections.synchronizedSet). I switched to use 
CopyOnWriteArraySet, which has far better performance due to more reads and 
small set size.

- In-order processing path improvements: AsyncRunLoop handles the callbacks 
regardless of whether it's in-order or out-of-order (max concurrency > 1), 
which incurs quite some cost. By simplying the logic for in-order handling, the 
performance gains.

After all three improvements, my test job with mock input (8M messages) can be 
processed within 8 sec, so it's 1M/s for one cpu core.


> Samza performance improvements
> ------------------------------
>
>                 Key: SAMZA-1043
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1043
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Xinyu Liu
>            Assignee: Xinyu Liu
>             Fix For: 0.12.0
>
>         Attachments: HDFS-performance.xlsx
>
>
> In the recent experiments of samza batch job (consuming hdfs data on hadoop), 
> the results are subpar to map/reduce and spark. By looking at the metrics 
> closely, we found two basic problems:
> 1) Not enough data to process. This is spotted as the unprocessed message 
> queue length was zero for quite a lot of times.
> 2) Not process fast enough. We found samza performed closely in both median 
> size records (100B) and small record (10B), while spark can scale very well 
> in the small record (over 1M/s).
> The first problem is solved by increasing the buffer size. This ticket is to 
> address the second problem, which contains three major improvements:
> - Option to turn off timer metrics calculation: one of the main time spent in 
> samza processing turns out to be just keeping the timer metrics. While it is 
> useful in debugging, it becomes a bottleneck when running a stable job with 
> high performance. In my testing job which consumes 8M mock data, it took 30 
> secs with timer metrics on. After turning it off, it only took 14 secs.
> - Java coding improvements: The AsyncRunLoop code can be further optimized 
> for efficiency. Some of the thread-safe data structure I am using is not for 
> optimal performance (Collections.synchronizedSet). I switched to use 
> CopyOnWriteArraySet, which has far better performance due to more reads and 
> small set size.
> - In-order processing path improvements: AsyncRunLoop handles the callbacks 
> regardless of whether it's in-order or out-of-order (max concurrency > 1), 
> which incurs quite some cost. By simplying the logic for in-order handling, 
> the performance gains.
> After all three improvements, my test job with mock input (8M messages) can 
> be processed within 8 sec, so it's 1M/s for one cpu core. 
> For the performance benchmark jobs running in Hadoop, we also see a 4 times 
> improvement with all the fixes above. Please take a look at the attached 
> spreedsheet (see the numbers with fix(turn off the timing metrics) and 
> fix2(all three together).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to