Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10969
(In reply to comment #67) > The brw_stats looks enough for the time being. But please keep in mind that > users do NOT have access to the servers and the brw_stats info stored on the > OSTs will NOT be available to the apps perf tool directly. Yes, I'm just trying to get a handle on whether we're collecting the right data in the first place. Collecting/presenting it is a different challenge. > Anomalies were meant over all the clients and per client as well. It was > suggested as an idea to keep track of a slow client or a slow server for the > duration of an application. Also it can be a very powerful tool when combined > with the timestamps (see below please). With the current stats at any given moment we could compare e.g. the average ost_setattr execution time and note that OST5 is 10% slower than the average OST, or that client7 has the highest average write size on OST2. I think potentially one of most difficult parts of this tool is deciding how to prune down the data we present into a comprehensible amount. > Timestamped info means the ability to playback the I/O for the duration of an > application. It does not need to be very fine grained (i.e. aggregate > timestamped summary info for every X msecs/secs per each client/server should > be > sufficient). e.g. something like: 11:02 client7 7MB w, 10MB r, 3004 RPCs, waited for 5 locks, 10 locks revoked The more concrete we can make our examples, the better. > Yes, we meant RPC request queues (e.g. time spent on queue, queue depth). Ok. We already collect this information per server. req_waittime 117364 samples [usec] 34 23445 21251101 7973894281 req_qdepth 117364 samples [reqs] 0 8 29906 30464 > Probably not ALL RPC related info. I am assuming "ALL" would be overwhelming > to > analyze and digest. Perhaps, we need to list the most striking ones. What do > you > suggest Nathan? The slow outliers would probably be the most interesting. server info: - at 11:02 req 1002 type 42 from client7 took 102s to process - at that time, the q depth was 5, the avg waittime was 10s, and the average req of that type took 6s client info: - req 1002 from process 7 "ior" opc=fsync _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
