Re: BK metrics

2018-03-21 Thread Vijay Srinivasaraghavan
 
> The disks were under 60% utilization (not saturated).

>> 60% of bandwidth or iops? only one of the two needs to be saturated.
And which disk, journal or ledgers?
It is the disk busy percentage. Both journal and ledger disks were around 60% 
(journal was more consistent).

> Are there any benchmark results of BookKeeper that can be shared?

>> I don't have any to hand, but maybe someone else on the list does.

What key bookkeeper metrics that you would suggest to monitor? It will be nice 
if there is any documentation around the metrics that talks at a high level 
about what the metrics is about (in terms of how to understand/interpret) and 
the expectations around it (numbers on best/worst case scenarios).
RegardsVijayOn Tuesday, March 20, 2018, 11:34:39 PM PDT, Ivan Kelly 
 wrote:  
 
 > @Ivan, for some reasons I did not receive your reply but found it in the 
 > email archives.

Are you subscribed to the list? I did see one mail from you show up in
moderation.

> At 80K request/sec throttling for record size of 1K, I am getting below 
> throughput. The 99th percentile of `bookkeeper_server_ADD_ENTRY_REQUEST` and 
> `bookkeeper_server_ADD_ENTRY` are around 350 ms. I am starting to see the lag 
> when I increase the ingestion rate limit beyond 90 K/sec limit.

So this suggests to me that the metrics are reporting correctly.

> The disks were under 60% utilization (not saturated).

60% of bandwidth or iops? only one of the two needs to be saturated.
And which disk, journal or ledgers?

> Are there any benchmark results of BookKeeper that can be shared?

I don't have any to hand, but maybe someone else on the list does.

Regards,
Ivan  

Re: BK metrics

2018-03-21 Thread Ivan Kelly
> @Ivan, for some reasons I did not receive your reply but found it in the 
> email archives.

Are you subscribed to the list? I did see one mail from you show up in
moderation.

> At 80K request/sec throttling for record size of 1K, I am getting below 
> throughput. The 99th percentile of `bookkeeper_server_ADD_ENTRY_REQUEST` and 
> `bookkeeper_server_ADD_ENTRY` are around 350 ms. I am starting to see the lag 
> when I increase the ingestion rate limit beyond 90 K/sec limit.

So this suggests to me that the metrics are reporting correctly.

> The disks were under 60% utilization (not saturated).

60% of bandwidth or iops? only one of the two needs to be saturated.
And which disk, journal or ledgers?

> Are there any benchmark results of BookKeeper that can be shared?

I don't have any to hand, but maybe someone else on the list does.

Regards,
Ivan


BK metrics

2018-03-21 Thread Vijay Srinivasaraghavan

```> 2) If it's in milliseconds, are these numbers in expected range (see
> attached image). To me 2.5 seconds  (2.5K ms) latency for add entry request
> is very high.
2.5 seconds is very high, but your write rate is also high. 100,000 *
1KB is 100MB/s. SSD should be able to take it from the journal side,
but it depends on the hardware.

Have you tried reducing the write rate to see how the latency changes?
What is the client seeing for latency? I assume the client and all
servers have  10GigE nics?

Your images didn't attach correctly. Maybe they're too big to post
directly to the list. There is a size limit, but I don't know what it
is.

-Ivan```
@Ivan, for some reasons I did not receive your reply but found it in the email 
archives. I have copied your response in this email for the context.
At 80K request/sec throttling for record size of 1K, I am getting below 
throughput. The 99th percentile of `bookkeeper_server_ADD_ENTRY_REQUEST` and 
`bookkeeper_server_ADD_ENTRY` are around 350 ms. I am starting to see the lag 
when I increase the ingestion rate limit beyond 90 K/sec limit.
The disks were under 60% utilization (not saturated). All clients and server 
machines have 10G nics. 
Throughput (records/sec): 79505, Throughput (bytes): 75.8 MB/s, Latency (ms): 
average-118, 50th-106, 75th-139, 90th-193, 99th-395, 999th-658
Are there any benchmark results of BookKeeper that can be shared?
RegardsVijay

BK metrics

2018-03-20 Thread Vijay Srinivasaraghavan
 Hello,

I am running a load test scenario where we have 3 Bookies, dedicated SSD's for 
journal and ledger, JVM heap size 5G with G1GC enabled. 

`jvm_opts: -Xmx5g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
-XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions 
-XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 
-XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC 
-XX:-ResizePLAB -XX:+PrintFlagsFinal -XX:+PrintGC -XX:+PrintGCCause 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps`

 I am testing 1000 bytes record size, ingesting around 50M record at the 
ingestion rate limit 100K/sec.

I wanted to understand how the metric types `stats` are reported.
bookkeeper_server_ADD_ENTRY_REQUEST
bookkeeper_server_ADD_ENTRYJOURNAL_ADD_ENTRYJOURNAL_SYNCJOURNAL_QUEUE_LATENCYJOURNAL_FLUSH_LATENCYJOURNAL_PROCESS_TIME_LATENCY
My understanding is that the above metrics are reported in micro seconds (from 
BK code) and the reporters (we use statsD to collect BK metrics `codahale` and 
sink it to `InfluxDB`) converts the `rates` to seconds and `duration` to 
`milliseconds`

1) I wanted to confirm if the final graph values that I am seeing in the UI 
(attached) is represented in milliseconds or some other units?
2) If it's in milliseconds, are these numbers in expected range (see attached 
image). To me 2.5 seconds  (2.5K ms) latency for add entry request is very high.
Any help to understand the metrics is much appreciated.
RegardsVijay





Re: BK metrics

2018-03-20 Thread Ivan Kelly
> 2) If it's in milliseconds, are these numbers in expected range (see
> attached image). To me 2.5 seconds  (2.5K ms) latency for add entry request
> is very high.
2.5 seconds is very high, but your write rate is also high. 100,000 *
1KB is 100MB/s. SSD should be able to take it from the journal side,
but it depends on the hardware.

Have you tried reducing the write rate to see how the latency changes?
What is the client seeing for latency? I assume the client and all
servers have  10GigE nics?

Your images didn't attach correctly. Maybe they're too big to post
directly to the list. There is a size limit, but I don't know what it
is.

-Ivan


BK metrics

2018-03-19 Thread Vijay Srinivasaraghavan
Hello,
I am running a load test scenario where we have 3 Bookies, dedicated SSD's for 
journal and ledger, JVM heap size 5G with G1GC enabled. 

`jvm_opts: -Xmx5g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
-XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions 
-XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 
-XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC 
-XX:-ResizePLAB -XX:+PrintFlagsFinal -XX:+PrintGC -XX:+PrintGCCause 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps`

 I am testing 1000 bytes record size, ingesting around 50M record at the 
ingestion rate limit 100K/sec.

I wanted to understand how the metric types `stats` are reported.
bookkeeper_server_ADD_ENTRY_REQUEST
bookkeeper_server_ADD_ENTRYJOURNAL_ADD_ENTRYJOURNAL_SYNCJOURNAL_QUEUE_LATENCYJOURNAL_FLUSH_LATENCYJOURNAL_PROCESS_TIME_LATENCY
My understanding is that the above metrics are reported in micro seconds (from 
BK code) and the reporters (we use statsD to collect BK metrics `codahale` and 
sink it to `InfluxDB`) converts the `rates` to seconds and `duration` to 
`milliseconds`

1) I wanted to confirm if the final graph values that I am seeing in the UI 
(attached) is represented in milliseconds or some other units?
2) If it's in milliseconds, are these numbers in expected range (see attached 
image). To me 2.5 seconds  (2.5K ms) latency for add entry request is very high.
Any help to understand the metrics is much appreciated.
RegardsVijay