Re:Re: Consult some problems of Ceph when reading source code

2015-08-06 Thread Sage Weil
On Thu, 6 Aug 2015, ?? wrote:
 Dear Dr.Sage:

 Thank you for your detailed reply?These answers helps me a lot. I also 
 have some problems in Question (1.

 In your reply, the requests according to the different PG enqueue into 
 the ShardedWQ, if I have 3 requests( that is 
 pg1,r1,pg2,r2,pg3,r3), and I put them to the ShardedWQ, is the 
 process also aserializes processing?

Lots of threads are enqueuing things into the ShardedWQ.  A deterministic 
function of the pg determines which shard the request lands in.

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8247

 When I want to dequeue the item from ShardedWQ, there is a work_queues 
 (the type is the vector of work_queue) in ThreadPool 
 method(WorkQueue.cc) and then I calculate the work queue according to 
 the work_queues, so is there many work queue in the request process?  
 or is there no association with the ShardedWQ?

https://github.com/ceph/ceph/blob/master/src/common/WorkQueue.cc#L350

Any given thread services a single shard.  There can be more than 
one threads per shard.  There's a bunch of code in OSD.cc that ensures 
that the requests for any given PG are processed in order, serially, 
so if two threads pull off requests for the same PG one will block so 
that they still complete in order.

 When I get the item from ShardedWQ, I will transfer it to the 
 transaction and then read or write. Is the process done one by one( 
 another transaction is handled only when this transaction is over), if 
 it is, Could we promise the performance? if it isn't , Are the 
 transactions'actions parallel?

The write operations are analyzed, prepared, and then started (queued for 
disk and replicated over the network).  Completion is asynchronous (since 
it can take a while).

The read operatoins are currently done synchronously (we block while we 
read the data from the local copy on disk), although this is likely to 
change soon to be either synchronous or async (depending on the backend, 
hardware, etc.).

HTH!
sage


 Thank you a lot!
 
 
 
 
 
 At 2015-08-06 20:44:45, Sage Weil s...@newdream.net wrote:
 Hi!
 
 On Thu, 6 Aug 2015, ?? wrote:
  Dear developers,
  
  My name is Cai Yi, and I am a graduate student majored in CS of Xi?an 
  Jiaotong University in China. From Ceph?s homepage, I know Sage is the 
  author of Ceph and I get the email address from your GitHub and Ceph?s 
  official website. Because Ceph is an excellent distributed file system, 
  so recently, I am reading the source code of the Ceph (the edition is 
  Hammer) to understand the IO good path and the performance of Ceph. 
  However, I face some problems which I could not find the solution from 
  Internet or solve by myself and my partners. So I was wondering if you 
  could help us solve some problems. The problems are as follows:
  
  1)  In the Ceph, there is a concept that is the transaction. When the 
  OSD receives a write request, and then it is encapsulated by a 
  transaction. But When the OSD receive many requests, is there a 
  transaction queue to receive the messages? If there is a queue, is it a 
  process of serial or parallel to submit these transaction to do next 
  operation? If it is serial, could the transaction operations influence 
  the performance?
 
 The requests are distributed across placement groups and into a shared 
 work queue, implemented by ShardedWQ in common/WorkQueue.h.  This 
 serializes processing for a given PG, but this generally makes little 
 difference as there are typically 100 or more PGs per OSD.
 
  2)  From some documents about Ceph, if the OSD receives a read request, 
  the OSD can only read data from primary and then back to client. Is the 
  description right?
 
 Yes.  This is usually the right thing to do or else a given object will 
 end up consuming cache (memory) on more than one OSD and the overall cache 
 efficiency of the cluster will drop by your replication factor.  It's only 
 a win to distributed reads when you have a very hot object, or when you 
 want to spend OSD resources by reduce latency (e.g., by sending reads to 
 all replica and taking the fastest reply).
 
  Is there any way to read the data from replicated 
  OSD? Do we have to require the data from the primary OSD when deal with 
  the read request? If not and we can read from replicated OSD, could we 
  promise the consistency?
 
 There is a client-side flag to read from a random or the closest 
 replica, but there are a few bugs that affect consistency when recovery is 
 underway that are being fixed up now.  It is likely that this will work 
 correctly in Infernalis, the next stable release.
 
  3)  When the OSD receives the message, the message?s attribute may be 
  the normal dispatch or the fast dispatch. What is the difference between 
  the normal dispatch and the fast dispatch? If the attribute is the 
  normal dispatch, it enters the dispatch queue. Is there a single 
  dispatch queue or multi dispatch queue to deal 

Re:Re: Consult some problems of Ceph when reading source code

2015-08-06 Thread 蔡毅
Dear Dr.Sage:
Thank you for your detailed reply!These answers helps me a lot. I also have 
some problems in Question (1.
In your reply, the requests according to the different PG enqueue into the 
ShardedWQ, if I have 3 requests( that is pg1,r1,pg2,r2,pg3,r3), and I put 
them to the ShardedWQ, is the process also aserializes processing?
When I want to dequeue the item from ShardedWQ, there is a work_queues (the 
type is the vector of work_queue) in ThreadPool method(WorkQueue.cc) and then I 
 calculate the work queue according to the work_queues, so is there many work 
queue in the request process?  or is there no association with the ShardedWQ?
When I get the item from ShardedWQ, I will transfer it to the transaction and 
then read or write. Is the process done one by one( another transaction is 
handled only when this transaction is over), if it is, Could we promise the 
performance? if it isn't , Are the transactions'actions parallel?
Thank you a lot!





At 2015-08-06 20:44:45, Sage Weil s...@newdream.net wrote:
Hi!

On Thu, 6 Aug 2015, ?? wrote:
 Dear developers,
 
 My name is Cai Yi, and I am a graduate student majored in CS of Xi?an 
 Jiaotong University in China. From Ceph?s homepage, I know Sage is the 
 author of Ceph and I get the email address from your GitHub and Ceph?s 
 official website. Because Ceph is an excellent distributed file system, 
 so recently, I am reading the source code of the Ceph (the edition is 
 Hammer) to understand the IO good path and the performance of Ceph. 
 However, I face some problems which I could not find the solution from 
 Internet or solve by myself and my partners. So I was wondering if you 
 could help us solve some problems. The problems are as follows:
 
 1)  In the Ceph, there is a concept that is the transaction. When the 
 OSD receives a write request, and then it is encapsulated by a 
 transaction. But When the OSD receive many requests, is there a 
 transaction queue to receive the messages? If there is a queue, is it a 
 process of serial or parallel to submit these transaction to do next 
 operation? If it is serial, could the transaction operations influence 
 the performance?

The requests are distributed across placement groups and into a shared 
work queue, implemented by ShardedWQ in common/WorkQueue.h.  This 
serializes processing for a given PG, but this generally makes little 
difference as there are typically 100 or more PGs per OSD.

 2)  From some documents about Ceph, if the OSD receives a read request, 
 the OSD can only read data from primary and then back to client. Is the 
 description right?

Yes.  This is usually the right thing to do or else a given object will 
end up consuming cache (memory) on more than one OSD and the overall cache 
efficiency of the cluster will drop by your replication factor.  It's only 
a win to distributed reads when you have a very hot object, or when you 
want to spend OSD resources by reduce latency (e.g., by sending reads to 
all replica and taking the fastest reply).

 Is there any way to read the data from replicated 
 OSD? Do we have to require the data from the primary OSD when deal with 
 the read request? If not and we can read from replicated OSD, could we 
 promise the consistency?

There is a client-side flag to read from a random or the closest 
replica, but there are a few bugs that affect consistency when recovery is 
underway that are being fixed up now.  It is likely that this will work 
correctly in Infernalis, the next stable release.

 3)  When the OSD receives the message, the message?s attribute may be 
 the normal dispatch or the fast dispatch. What is the difference between 
 the normal dispatch and the fast dispatch? If the attribute is the 
 normal dispatch, it enters the dispatch queue. Is there a single 
 dispatch queue or multi dispatch queue to deal with all the messages?

There is a single thread that does the normal dispatch.  Fast dispatch 
processes the message synchrnonously from the thread that received the 
message, so it faster, but it has to be careful not to block.

 These are the problem I am facing. Thank you for your patience and 
 cooperation, and I look forward to hearing from you.

Hope that helps!
sage
N嫥叉靣笡y氊b瞂千v豝�)藓{.n�+壏渮榏z鳐妠ay�蕠跈�,jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸濟!秈