PATCHES - Countdown to March 1
Another month fading into the past, and here is the current countdown list. The next countdown will be on March 1st. A list of all merge requests can be found here: https://gitlab.com/lilypond/lilypond/-/merge_requests?sort=label_priority Push: No patches in Push at this time. Countdown: !1229 Trade _ for G_ in all Scheme code - Jean Abou Samra https://gitlab.com/lilypond/lilypond/-/merge_requests/1229 !1228 Do not notate \repeat ... 1 - Dan Eble https://gitlab.com/lilypond/lilypond/-/merge_requests/1228 !1217 Add -ddeterministic option, use with lysdoc-test - Han-Wen Nienhuys https://gitlab.com/lilypond/lilypond/-/merge_requests/1217 Review: !1237 Doc: add note that MIDI volume cannot be changed mid-note - Jean Abou Samra https://gitlab.com/lilypond/lilypond/-/merge_requests/1237 !1235 Draft: Lyric repeat count - Dan Eble https://gitlab.com/lilypond/lilypond/-/merge_requests/1235 !1233 output-distance: refactor eps_to_png - Han-Wen Nienhuys https://gitlab.com/lilypond/lilypond/-/merge_requests/1233 !1232 regression: split off a \book from paper-size-custom.ly - Han-Wen Nienhuys https://gitlab.com/lilypond/lilypond/-/merge_requests/1232 !1230 Basic Guile 3 compatibility - Jean Abou Samra https://gitlab.com/lilypond/lilypond/-/merge_requests/1230 New: No patches in New at this time. Waiting: !1231 Separate help for internal vs external -d options. - Han-Wen Nienhuys https://gitlab.com/lilypond/lilypond/-/merge_requests/1231 !1136 \break implies \bar "" when needed - Dan Eble https://gitlab.com/lilypond/lilypond/-/merge_requests/1136 !1103 output-distance: use ImageMagick's compare for visual comparisons - Han-Wen Nienhuys https://gitlab.com/lilypond/lilypond/-/merge_requests/1103 !913 release: binaries with cairo - Han-Wen Nienhuys https://gitlab.com/lilypond/lilypond/-/merge_requests/913 Cheers, Colin
Re: Blockers for Guile 2.2
On Sun, Feb 27, 2022 at 12:13 PM Han-Wen Nienhuys wrote: > On Sun, Feb 27, 2022 at 10:39 AM Luca Fascione > wrote: > > is it true that if you double the source size you double the compilation > time? > > it should be, but we have rather complicated page breaking code that > is so hairy that nobody understands it fully. I'm not sure there is > NP-complete snake in the proverbial grass there. > Understood. As a use of both lilypond and LaTeX I have been idly wondering for years whether modern computers can afford to use global line/page breaking algorithms that would avoid some of the shortcomings of TeX's approach. A discussion for a different thread, of course. accessing data (eg. offsets): > * on first access, the callback gets executed. This is just evaluating > ( ). > * on second access, the value is cached in an alist, and looking it up > is extremely cheap. > This is cool. :-) I don't know enough about this program to even begin to have a gut feeling, however I guess I'm thinking it seems there would be tons of these reads, and I'm hearing you say that in an eventual sense, all data access is an alist access. I don't know how alists are actually implemented under the hood, but they feel like they would be a linear scan with symbol-typed keys on the surface. So to pull out one float you're doing what 5-10 64bit compares (checking the keys) and just as many pointer jumps, right? (I'm thinking the alist is a list of symbol/value pairs in the implementation also). This cost strongly dominates the float dereference itself, and there is this question of how much extra stuff is happening around it (I guess in my mind I'm comparing it to element access in C++ which is one pointer (this), one offset for the member (compiled into an immediate), one load of this+offset (which the hardware helps you with)). I feel for the moment I can't provide any concrete insight into any of this, because I don't know the specifics enough. > Code following a simple pattern like this, once compiled, will largely be > dominated by the > > scripting language runtime overhead > > From the outside this may seem plausible, but I doubt your intuition here: > > * the callback tables are also alists. They are cheap (they aren't > sped up if you swap them for hash tables) > Not presuming to know your program better than you, but I'd just bring up that this is saying that your lists are short (likely length 20ish on average): the observation you report is that hashing so you can do a direct access into an array is not faster than several pointer-pointer comparisons and pointer chases. The hash you'd use here be something like FNV or so, it'll break even somewhere in the 10-20 comparisons, I'd expect. > * Scheme has no marshaling: objects are either direct (scheme -> C++ > is bit twiddling), or they are indirect (a tagged pair with a C++ > pointer in the cdr) > Half of that I expected (more specifically, for various reasons, a number of which not accurate, I expected the Scheme APIs to be similar to the TCL APIs, and there as well you just get handed straight what TCL has in hand, not marshalling involved). One thing I didn't know is that the client calls to extract the machine representation of the value would be super cheap. But still, if the guile compiler translates scheme values into native ones and is able to leave them there for "long" stretches of code in some cases, and our use case instead prevents that, it seems it could eventually add up. Again I do need to learn the source better before you give these thoughts any real weight. > IMO The real problem is that we don't have good tooling to see what is > going on in the Scheme side of things: C++ has decent perf analysis, > but the Guile side of things just looks like a lot of time spent in > scm_eval(). Some of it is overhead, some if it might be a poorly > implemented Scheme function (which is often easier to fix than > reducing overhead.) > Very agreed that poorly conceived code is the first thing to address, no doubt. I'd think that the way to gain insight as to what's going on is to inspect the bytecode actually, and gain familiarity with the APIs that execute it. Is it that the bytecode is then translated to executable, or is it running on a VM? I would assume they don't provide a decompiler of any sort, do they? Thanks for a most interesting discussion L
Re: Blockers for Guile 2.2
On Sat, Feb 26, 2022 at 2:02 PM Jonas Hahnfeld wrote: > > > The same happens for C++ files, you also have to recompile. But it's > > > true that editing scm files isn't for free anymore. > > > > The Scheme compilation felt much slower, and for C++ ccache takes away > > a lot of the pain of recompiles. It also appears to be > > single-threaded? I admit not having timed it in detail. > > Yes, GUILE_AUTO_COMPILE=1 is single-threaded which is why it will never > be the default. If we precompile as part of the build, I wonder if we could use a hack like -djob-count to split lilypond into N jobs where each one does a part of the job. https://www.gnu.org/software/guile/manual/html_node/Compilation.html suggests that one can call the compiler explicitly. -- Han-Wen Nienhuys - hanw...@gmail.com - http://www.xs4all.nl/~hanwen
Re: Blockers for Guile 2.2
On Sun, Feb 27, 2022 at 10:39 AM Luca Fascione wrote: > In other words, is it approximately true that "for (almost) any real-life > score the total compilation time > is proportional to the number of NoteHeads, past a certain size"? > I'm guessing you need a few pages worth of material to kill away constant > overheads, but beyond that > is it true that if you double the source size you double the compilation time? it should be, but we have rather complicated page breaking code that is so hairy that nobody understands it fully. I'm not sure there is NP-complete snake in the proverbial grass there. >> I am not sure what you mean by effort spent >> in "preparing" callbacks, could you elaborate? > > > Imagine the grob is a C++ object that remains opaque to Scheme. So it's a > thing for which > in Scheme you move around a blind pointer, but to read property Y-offset > you'd call (grob-get 'Y-offset) > and grob-get is C++. > .. accessing data (eg. offsets): * on first access, the callback gets executed. This is just evaluating ( ). * on second access, the value is cached in an alist, and looking it up is extremely cheap. > Code following a simple pattern like this, once compiled, will largely be > dominated by the > scripting language runtime overhead >From the outside this may seem plausible, but I doubt your intuition here: * the callback tables are also alists. They are cheap (they aren't sped up if you swap them for hash tables) * Scheme has no marshaling: objects are either direct (scheme -> C++ is bit twiddling), or they are indirect (a tagged pair with a C++ pointer in the cdr) IMO The real problem is that we don't have good tooling to see what is going on in the Scheme side of things: C++ has decent perf analysis, but the Guile side of things just looks like a lot of time spent in scm_eval(). Some of it is overhead, some if it might be a poorly implemented Scheme function (which is often easier to fix than reducing overhead.) > (*) 'boxing' is a C# word to mean wrap a POD value in an object to move it > around in a way > that is uniform with the other objects in the language. C# people need to > keep boxing and > unboxing costs in their head in code with lots of PODs being moved around, > because that > cost can dominate the execution of their code. I'm not sure what word is used > in Guile for this > concept. For integers, floats and booleans, the CPU native representation involves some bit operations. Intermediate conversions could be short-cut if you have full type information, and have a sequence of those in a row. -- Han-Wen Nienhuys - hanw...@gmail.com - http://www.xs4all.nl/~hanwen
Re: Blockers for Guile 2.2
On Sun, Feb 27, 2022 at 12:31 AM Jean Abou Samra wrote: > > I never said I don't want to fix Guile 2.2 bugs, and you should know > > as I spent lots and lots of time debugging #6218. I also said I > > support moving CI to 2.2, so any MR would pass against 2.2. > > > > I am just asking to not drop 1.8 support. > > > > Most of the work we do isn't actually about Guile anyway, > > > > $ git log --since 2022/01/01 | grep ^commit|wc > > 257 514 12336 > > $ git log --grep=[Gg]uile --since 2022/01/01 | grep ^commit|wc > >7 14 336 > > I agree that the version-specific code we have (cond-expand > and GUILEV2) isn't all that much. On the other hand, I would > be glad to be able to use Guile 2 features, such as Unicode > support, when and unless, (srfi srfi-71) (let and let* > accepting multiple values), S-expression comments, > scm_c_bind_keyword_arguments, and a few others. Now > that my principal concern with the sustainability of > Guile 2 binaries (shipping bytecode with them) is cleared, > I have mixed feelings about when to leave Guile 1 behind. You say you have mixed feelings, but I think (with your updates to compilation), those feelings are ever less mixed? > In my previous post I showed that compiling all Scheme > files can be done in 20s with Guile 2 (so a few seconds > for a single file), and 4s with Guile 3 (so near instant > for one file). Would that address your concern with > compilation speed? Thanks a ton for your investigation into this. This is a game changer: MSDM-reduced 1.8: real 0m14.788s 2.2: real 0m14.648s les-nereides: 1.8: real 0m1.376s 2.2: real 0m1.224s Let's kill 1.8 support. -- Han-Wen Nienhuys - hanw...@gmail.com - http://www.xs4all.nl/~hanwen
Re: Blockers for Guile 2.2
On Sat, Feb 26, 2022 at 10:48 PM Jean Abou Samra wrote: > > [Jonas] > > He, I always thought auto-compilation didn't optimize! now don't > > tell me Guile also applies optimizations while just reading and > > supposedly interpreting code... > > I don't think it does. At least, you don't usually call eval or > primitive-eval on code to run it repeatedly, just for one-shot > code, so I don't see the sense it would make. Also, it seems > to me that the Guile developers are much more focused on compilation. > Randomly, it seems to me my own problem with fingering might possibly bring up a counter example pattern (for eval vs compile): when you attach a callback to something that is evaluated often (in the case of fingering there is an after-line-break callback, which I am guessing runs O(count(Fingering)) ~= O(count(NoteHead)). Isn't this a case of scheme code coming in from the .ly file (likely an include ly file) and being run O(N) times? I can't say I understand lilypond super well, but it doesn't feel like it would employ often alogrithms that are superlinear in the NoteHead count (it seems most activities would be organized as linear scans), so I'm guessing O(N) is as bad as it gets for a cursory evaluation? In other words, is it approximately true that "for (almost) any real-life score the total compilation time is proportional to the number of NoteHeads, past a certain size"? I'm guessing you need a few pages worth of material to kill away constant overheads, but beyond that is it true that if you double the source size you double the compilation time? The background for this was what David was saying, whether code in .ly files would be optimized or not. At a guess, I'd think stuff that is big and runs O(n) times might make some sense to see if need to optimize. I am sensitive to your other passage of the nightmare it will be to keep this stuff around and properly invalidate it upon changes. > [ skipping over the part regarding Guile 3, since I don't think it's > > relevant here ] > > Perhaps I should have changed the title, but I do think it > is relevant -- it gives hope for a state where development > cycles will be easier. When we want Guile 3 is another question. > I'm in favor of not making the move to Guile 2 wait more > than it already has, but I wonder if at some point in this release > cycle (by which I mean before the next stable release) we will want > to switch to Guile 3. > fwiw, this sounds super reasonable to me. I was just trying to get orders > of magnitude (for Guile 3 it's 1m30 vs. 4s, no need for > precise benchmarks to see which is faster :-). > Indeed. I was more thinking that not only the opt/no-opt numbers are close, but also 18-ish to 19ish seconds are close, it is possible that difference too is spurious for some reason (I guess I'm saying: there's a possibility you have been a little lucky or a little unlucky, and your actual runtime difference is closer to 2% or maybe 15%, but you happened to take samples at "weird" times. I might have some time next week (after 7 march) to run these tests several times, depends on some stuff I have going on for next weekend. I'll contact you oob if I do for some quick guidance exchange. I am no performance expert, but LilyPond's performance-critical parts > are written in C++, so in general I would think the cost of Scheme code > is spread a bit all over the code base (unlike C++ where you can find > some rather critical parts taking lots of time, like page breaking, > or skyline algorithms). Well but this would be single call from scheme into C++ that take "a long time", do I read you right? Instead I was thinking more of the "dying from a thousand cuts" kinda scenario. [continue below] > I am not sure what you mean by effort spent > in "preparing" callbacks, could you elaborate? > Imagine the grob is a C++ object that remains opaque to Scheme. So it's a thing for which in Scheme you move around a blind pointer, but to read property Y-offset you'd call (grob-get 'Y-offset) and grob-get is C++. And then in C++ you'd just have a one-liner accessor that is conceptually float scheme_grob_get(Grob* grob, sch_value* propname ) { // typecheck propname to be a symbol // find out if you have a symbol and what accessor you want to delegate to return grob.get_y_offset(); // actual delegation } and in turn the getter is something like inline float get_y_offset() const { return y_offset; /* this is a float member, say */ } Code following a simple pattern like this, once compiled, will largely be dominated by the scripting language runtime overhead in traversing through the callback table, finding the function pointer for the dispatch, marshalling the values from scheme representation to something you can mess with in C++ and the way back upon return. I've read enough of your C++ to see that a lot of this happens in client code, but either way it's all code that dominates in cost the execution of the accessor