guihecheng edited a comment on pull request #3254:
URL: https://github.com/apache/ozone/pull/3254#issuecomment-1085357251


   Thanks @sodonnel for your time to take a look at this, this PR is just a 
possible proposal for discussing, and we need more concerns on some of the 
problems.
   
   It seems that there is another possible way proposed, and I've got some 
doubts below.
   
   > I read through the doc and the change here, and I am not sure tracking 
space is the correct way to solve this. I know that EC files should be large, 
but they will not always be. Also a large cluster does not always need a lot of 
EC pipelines open. 
   
   Well, I agree that we don't always need a lot of pipelines open, especially 
the IO load is low, but we need more often on large loads, and a larger cluster 
tends to serve larger loads. Assume EC 3:2, for 1 cluster of 100 nodes I think 
we may often need more open pipelines than a cluster of 5 nodes so as to 
utilize more DNs to carry IO loads(I've got a small test on a cluster of 30 DNs 
during the first discussion under the JIRA, more pipelines benefits performance 
before we drain the client bandwidth).
   And we don't have to worry about too many pipelines, the original code has a 
minimum for open pipelines to bound the number of pipelines, and it is still 
there, only with larger value for a larger cluster.
   So since we are bounded, I think we don't have to worry too much about small 
file loads that rarely come into the EC path in  real production deployments.
   
   > Sometimes, the main load may be from reads, or there is some rarely used 
EC policy (eg a tiny number of writes are using EC-3-2). That is why I think 
the "block allocation rate" is the best way to gauge the number of pipelines we 
need.
   
   Actually EC read is often slower than replicated read, so EC tends to serve 
large cold data when there is more writes than read, hot data is often cached 
or converted to 2-way/3-way replicated. And we tends to use only a few EC 
policies in a single cluster.
   BTW, from my POC test in the doc, you could see exactly from the graphs 
that: more block allocations come, more new pipelines are opened and when there 
are less block allocations, the number of pipelines go down back to the minimum 
bound and stayed there as expected.
   
   > If we start with some sensible, but configurable minimum and some upper 
bound based on the registered nodes.
   
   Sure, upper bound should be related to registered nodes.
   But what is a "sensible" minimum in your mind then? Any possible 
contributing factors like the number of nodes in cluster, client bandwidth, or 
else?
   
   > Then we keep track of the block allocation requests per time period in the 
ECWritableContainerProvider and per EC policy. We can guess the time it takes 
for a client to write a full block - it will always be approximate. We don't 
know how much of the block will be filled, or if the writer is a slow writer 
streaming events, or a fast writer. We know the max MB they can write, as it 
will be the `blockSize * Required_Nodes_For_EC_Policy`, for 6-3, that will be 
`256 * 9 = 2304MB`. The data is written mostly serially, so guess 150MB/s, it 
will take about 15 seconds to write that block. We we scale that number back by 
some factor as not all blocks will be filled. Eg assume it is 50% of that.
   > 
   
   Here I disagrees that we should use an experience-based value like "150MB/s" 
to do estimation, because it depends largely on the hardware we use, e.g. 10GE 
NIC would outperforms much than 1GE ones, and we have even faster cards(25GE, 
40GE). And you could imagine that different disks will contribute to throughput 
in a similar way, even we don't tend to use SSDs for EC.
   Other factors like client concurrency, other co-existing services sharing 
the resource of the client, complex network topology involves switches and 
racks all contributes to the IO speed.
   So, we don't want to have ozone users do complex but hard to be accurate 
calculations as estimations before they can have an ozone cluster with EC 
deployed, right?
   
   > If we are seeing 10 block requests per second for an EC policy, and it 
takes 15 seconds to write the full block, perhaps we need 10 * 15 = 150 
pipelines, or we can scale that by `block_fill_factor`. If the load drops to 1 
request per second, we only need 10.
   > 
   > The other thing we need to consider, is that Ratis pipelines can have many 
open containers on a single pipeline, and each container is constrained to a 
single disk. An EC pipleline only has a single container and hence a single 
disk on a DN. So we need to consider the number of disks on the DNs as well as 
the number of nodes.
   
   Here I raised the point in the doc that, we need to consider at least Disks 
and Network at the same time for a single DN, because performance of a single 
DN is bounded largely on these 2 factors at the same time, but ozone don't 
collect info on NICs and many storage systems don't.
   On the ozone SCM side, most placement polices only considers nodes and let 
DN manages the disks by itself.
   So let's start with a configured limit for the max number of pipelines per 
datanode as Ratis does, later we could introduces more calculation-based, 
reasonable values. 
   
   > I am not sure how fine grained we would need to track the request rate, eg 
per second, per 10 seconds, per minute. Or should we have something like the 
Linux top command were it has the 1, 5 and 15 minute average, and if we did 
have that, how would we use it?
   
   Yeah, exactly, there's hardly any rationale to track the request rate as a 
hint for resource allocation, right? 
   Usually we only have request rates just as monitor metrics to let us 
understand the load and performance of the system.
   
   > I feel the existing close logic should handle containers filling OK 
without having to worry about it in the WriteableContainer provider. The DN 
triggers the close at some percentage full, expecting more blocks will continue 
to be written. For EC containers the problem is even less as the blocks are 
spread across the replicas more than with Ratis.
   
   I agree with this point, and I don't touch the close logic in the PR, the 
allocatedSpace is only a hint for pre-allocating new pipelines, we don't force 
close the pipeline/container.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to