guihecheng edited a comment on pull request #3254: URL: https://github.com/apache/ozone/pull/3254#issuecomment-1085738709
Thanks for your reply @sodonnel , so generally I'd say that the PR is using one way of tracking the "block allocation rate" or "load", that is the allocated space, it is not accurate, but enough to be used as a hint for new pipelines. This is a way that I could think of which has a simple and straight-forward method to define the 'trigger' to allocate new pipelines -- "The container is potentially full". > The formula I was suggesting is only a first guess. A lot of things can influence what the write speed will be, but the limiting factor is almost certainly the disk speed, rather than NIC speed. The NIC's will generally be much faster than the sustained disk speed. Here I disagrees that the NIC speed is not a limit, especially for EC, the client NIC speed is usually a bottleneck since a client has to transfer data concurrently to many DNs(think of EC 10:4, not just 3:2). For the Ratis case, replication is done by the Ratis Leader to the Ratis Follower so Ratis uses inter-DN network bandwidth, but EC writes use purely the client output bandwidth. Actually I've hitted the NIC speed limit of a 10GE NIC with an experimental Ozone deployment of only 30 simulated DNs in 3 physical servers, testing a load of 10 concurrent writes of 10 8GB files. So if there are more convincing formula or calculation methods, you could raise it for discussion. > We also cannot make broad assumptions such as "ec is only for cold data" or "ec will always be large files". People will do all sort of things we cannot expect. With modern hardware, it is reasonable on HDFS to have most data be EC provided the blocks are not small, and Ozone will likely be the same. Well, this is only an observation that leads me to the idea of "tracking block allocation rate with allocated space", but it doesn't mean that it only works for "cold", "large" files. It could serve small files with the existing bound for open pipelines. I mean even there are pure small files, we are bounded on the number of open pipelines. And small files will get small number of pre-allocated blocks when they are created(We pre-allocate at most 64 blocks for big files, and at least 1 block for small files), so there won't be too much more "block allocated requests" or "loads". I could do further tests on such scenarios with this approach. > The number of open pipelines really should be a function of the write load, and the only was I can see SCM monitor that is by the rate of block requests. If the write load is very small, it doesn't matter how many pipelines are open - even one will do. We ideally want the containers to close after reaching their size limit relatively quickly. We also don't want to end up with 100 containers each with one block in them for a very slow write rate, as that could result in many small containers over time and currently we don't have a way to merge them. I can't agree more that we should keep the number of open containers under reasonable limit and we don't want to have too many small containers. But I think we don't have to worry so so much about that. Ozone and HDFS are long-running systems as backends for big data processing, we have large number of HDFS clusters busy running with all kinds of loads in our production environment all day round. So containers will be closed at last even we have all small writes, they will reach their size limit finally unless they are force closed due to other unpredictable events other than container full. I'll try to collect all the container close code paths and investigate the cases that leads to early closed small containers. At least containers are open, they will be continuously serving allocation requests till they are full. > I guess the space tracking is one way to track the allocation rate, but have a suspicion simply tracking allocation rate would be more accurate, even though we cannot be sure how long it will take to write the given block. Yes, exactly, space tracking is just a form of allocation rate tracking. If you are suggesting other more accurate tracking approaches with certain explanations, I'm open to discuss. At last, believe me, how long it takes to write the given block is never a problem anyone could easily answer. We are doing all kinds of performance optimizations, code path refactoring from time to time, right? You may get a bounded estimation today, but it is not guaranteed to apply tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
