guihecheng edited a comment on pull request #3254:
URL: https://github.com/apache/ozone/pull/3254#issuecomment-1085738709


   Thanks for your reply @sodonnel , so generally I'd say that the PR is using 
one way of tracking the "block allocation rate"
   or "load", that is the allocated space, it is not accurate, but enough to be 
used as a hint for new pipelines.
   This is a way that I could think of which has a simple and straight-forward 
method to define the 'trigger' to allocate new pipelines -- "The container is 
potentially full".
   
   > The formula I was suggesting is only a first guess. A lot of things can 
influence what the write speed will be, but the limiting factor is almost 
certainly the disk speed, rather than NIC speed. The NIC's will generally be 
much faster than the sustained disk speed.
   
   Here I disagrees that the NIC speed is not a limit, especially for EC, the 
client NIC speed is usually a bottleneck since a client has to transfer data 
concurrently to many DNs(think of EC 10:4, not just 3:2).
   For the Ratis case, replication is done by the Ratis Leader to the Ratis 
Follower so Ratis uses inter-DN network bandwidth, but EC writes use purely the 
client output bandwidth.
   Actually I've hitted the NIC speed limit of a 10GE NIC with an experimental 
Ozone deployment of only 30 simulated DNs in 3 physical servers, testing a load 
of 10 concurrent writes of 10 8GB files.
   
   So if there are more convincing formula or calculation methods, you could 
raise it for discussion.
   
   > We also cannot make broad assumptions such as "ec is only for cold data" 
or "ec will always be large files". People will do all sort of things we cannot 
expect. With modern hardware, it is reasonable on HDFS to have most data be EC 
provided the blocks are not small, and Ozone will likely be the same.
   
   Well, this is only an observation that leads me to the idea of "tracking 
block allocation rate with allocated space", but it doesn't mean that it only 
works for "cold", "large" files. It could serve small files with the existing 
bound for open pipelines.
   I mean even there are pure small files, we are bounded on the number of open 
pipelines.
   And small files will get small number of pre-allocated blocks when they are 
created(We pre-allocate at most 64 blocks for big files, and at least 1 block 
for small files), so there won't be too much more "block allocated requests" or 
"loads".
   I could do further tests on such scenarios with this approach.
   
   > The number of open pipelines really should be a function of the write 
load, and the only was I can see SCM monitor that is by the rate of block 
requests. If the write load is very small, it doesn't matter how many pipelines 
are open - even one will do. We ideally want the containers to close after 
reaching their size limit relatively quickly. We also don't want to end up with 
100 containers each with one block in them for a very slow write rate, as that 
could result in many small containers over time and currently we don't have a 
way to merge them.
   
   I can't agree more that we should keep the number of open containers under 
reasonable limit and we don't want to have too many small containers. But I 
think we don't have to worry so so much about that.
   
   Ozone and HDFS are long-running systems as backends for big data processing, 
we have large number of HDFS clusters busy running with all kinds of loads in 
our production environment all day round.
   So containers will be closed at last even we have all small writes, they 
will reach their size limit finally unless they are force closed due to other 
unpredictable events other than container full.
   
   I'll try to collect all the container close code paths and investigate the 
cases that leads to early closed small containers. At least containers are 
open, they will be continuously serving allocation requests till they are full.
   
   > I guess the space tracking is one way to track the allocation rate, but 
have a suspicion simply tracking allocation rate would be more accurate, even 
though we cannot be sure how long it will take to write the given block.
   
   Yes, exactly, space tracking is just a form of allocation rate tracking.
   If you are suggesting other more accurate tracking approaches with certain 
explanations, I'm open to discuss.
   
   At last, believe me, how long it takes to write the given block is never a 
problem anyone could easily answer.
   We are doing all kinds of performance optimizations, code path refactoring 
from time to time, right? You may get a bounded estimation today, but it is not 
guaranteed to apply tomorrow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to