On 05/26, Theodore Tso wrote: > On Tue, May 26, 2026 at 09:52:40PM +0000, Jaegeuk Kim wrote: > > > It seems... surprising that the additional I/O operations are actually > > > throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug > > > into why this is happening, and whether there is anything that can be > > > optimized below the file system? > > > > I can't tell the exact size tho, roughly it's between 1GB and > > 4GB. And, per lots of test results with various tunings, it turned > > out memory allocation speed was the culprit. If we use 4KB page, we > > couldn't get the full bandwidth unless we set the biggest core > > running the highest frequency. > > OK, if we assume that the model file that you want to load is is 2GB > then the number of 4k pages that you need is a bit over half a million > (524288). So if it take 1 second with large folios (2 GB/s as you > stated above), and half-second without (4 GB/s), then you're basically > saying that it was costing you half-second to allocate 524288 > singleton pages. And the whole point of this exercise is to save that > half second? > > And I assume that these timing was using a performance cores, and part > of the goal here is to be able to use an efficiency core instead. > > Did I get that right?
Yes, right. > > > > But the problem with using small folios is that if you want to > > > actually *use* the memory, unless you want to segment out the memory > > > so it can't be used for anything other than the AI models (e.g., by > > > using somthing like hugetlbfs) it's just going to break up the memory > > > into smaller folios. So that's not actually going to *help* in actual > > > real life use cases. It might help for your artificial benchmarks / > > > experiments, but in the real life case where Android applications are > > > running and fragmenting all of the device memory, the large folios > > > won't be available *anyway*. > > > > Agreed it's hard to get this done perfectly tho, as the best effort on this > > particular AI model case, I focused on two timings when loading the models: > > 1) right after device boot, 2) dynamic loading when required. To secure high > > order pages, for 1), I disabled the large folio consumed by EROFS, while for > > 2), I tried to call compact_memory before loading the model. Both of cases, > > I could observe we could get fair amount of large folios. Yes, not 100% tho. > > If (1) is a common case in real life, the thing to do would be grab > 2GB of large folios early in the startup sequence, and then letting > erofs do its thing --- and then at the end of the startup, right before you > load the model, you can release the 2GB worth of large folios. > > (That being said, I'm guessing #1 is actually not that interesting, > since as a percentage of the time that it takes for an Android device > to startup, is adding an extra half-second *really* going to be > noticeable by the user?) > > But for case #2, that's the much more challenging case. If you don't > call compact_memory() you're going to burn half a second to allocate > the 4k pages, since the large folios won't be available. But if you > *do* call compact_memory() in a production ROM, depending fragmented the > memory is and how much memory have, calling compat_memory() could take > **minutes**. So what's the point? > > The bottom line is if it's right after device boot, there are simple > techniques that don't require hacking up the f2fs. But in the > demand-loaded case, calling compact_memory() is the last thing you'll > want to do. You're better either asking the mm to allocate the 4k > pages, or do whatever compaction it can do to just free up 2GB worth > of folios. (Calling compact_memory() is overkill, and only makes > sense in the context of benchmark / proof of concept demo.) > > Either way, trying to get file systems to avoid using large folios in > the hopes that this will speed up large AI model loading.... doesn't > seem to make sense. > > If the problem is fundamentally about making 2GB worth of large folios > available in a way that takes significantly less time that just > allocating the model using half-million 4k pages, that's the question > that we should be asking Matthew and the mm folks. Which is why it > was too bad we didn't raise this issue at LSF/MM earlier this month. Thanks for the context. To clarify a piece I missed earlier: the model pages are also utilized for inference. Our data shows that larger chunks yield higher inference speeds. Consequently, I required high-order pages to optimize both read throughput and inference latency. I will halt my current efforts and wait for alternative suggestions. > > > Indeed, I was off from LSF/MM for years due to various product issues, not > > related F2FS tho. Let me make some effort to attend upcoming ones like LPC, > > if I can get the budget from company. > > Next time, as a suggestion, feel free to raise the issue when the > LSF/MM CFP goes out, even if you don't think it's likely you will get > an invite. Indeed, with a sufficiently interesting topic, that's the > way to *get* an invitation. It will require breaking down the > technical requires as you and I have done for the last few messages on > this thread. > > Even if you can't attend LSF/MM due to time or budget reasons, there > are a number of your colleagues who are attending, who could raise the > question on your behalf. I've been known to do that once or twice on > behalf of other Google teams. But it does require that you approach > the usual LSF/MM suspects a good 2-3 months before the conference so > we can help you craft the an appropriate response to the CFP. Thanks for the suggestion. Will definitely do. > > Cheers, > > - Ted > > > _______________________________________________ > Linux-f2fs-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel _______________________________________________ Linux-f2fs-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
