PATCHES - Countdown to March 1

2022-02-27 Thread Colin Campbell

Another month fading into the past, and here is the current countdown list.

The next countdown will be on March 1st.

A list of all merge requests can be found here:
https://gitlab.com/lilypond/lilypond/-/merge_requests?sort=label_priority


 Push:

No patches in Push at this time.


 Countdown:

!1229 Trade _ for G_ in all Scheme code - Jean Abou Samra
https://gitlab.com/lilypond/lilypond/-/merge_requests/1229

!1228 Do not notate \repeat ... 1 - Dan Eble
https://gitlab.com/lilypond/lilypond/-/merge_requests/1228

!1217 Add -ddeterministic option, use with lysdoc-test - Han-Wen Nienhuys
https://gitlab.com/lilypond/lilypond/-/merge_requests/1217


 Review:

!1237 Doc: add note that MIDI volume cannot be changed mid-note - Jean 
Abou Samra

https://gitlab.com/lilypond/lilypond/-/merge_requests/1237

!1235 Draft: Lyric repeat count - Dan Eble
https://gitlab.com/lilypond/lilypond/-/merge_requests/1235

!1233 output-distance: refactor eps_to_png - Han-Wen Nienhuys
https://gitlab.com/lilypond/lilypond/-/merge_requests/1233

!1232 regression: split off a \book from paper-size-custom.ly - Han-Wen 
Nienhuys

https://gitlab.com/lilypond/lilypond/-/merge_requests/1232

!1230 Basic Guile 3 compatibility - Jean Abou Samra
https://gitlab.com/lilypond/lilypond/-/merge_requests/1230


 New:

No patches in New at this time.


 Waiting:

!1231 Separate help for internal vs external -d options. - Han-Wen Nienhuys
https://gitlab.com/lilypond/lilypond/-/merge_requests/1231

!1136 \break implies \bar "" when needed - Dan Eble
https://gitlab.com/lilypond/lilypond/-/merge_requests/1136

!1103 output-distance: use ImageMagick's compare for visual comparisons 
- Han-Wen Nienhuys

https://gitlab.com/lilypond/lilypond/-/merge_requests/1103

!913 release: binaries with cairo - Han-Wen Nienhuys
https://gitlab.com/lilypond/lilypond/-/merge_requests/913

Cheers,

Colin




Re: Blockers for Guile 2.2

2022-02-27 Thread Luca Fascione
On Sun, Feb 27, 2022 at 12:13 PM Han-Wen Nienhuys  wrote:

> On Sun, Feb 27, 2022 at 10:39 AM Luca Fascione 
> wrote:
> > is it true that if you double the source size you double the compilation
> time?
>
> it should be, but we have rather complicated page breaking code that
> is so hairy that nobody understands it fully. I'm not sure there is
> NP-complete snake in the proverbial grass there.
>

Understood. As a use of both lilypond and LaTeX I have been idly wondering
for years
whether modern computers can afford to use global line/page breaking
algorithms that
would avoid some of the shortcomings of TeX's approach.
A discussion for a different thread, of course.

accessing data (eg. offsets):
> * on first access, the callback gets executed. This is just evaluating
> ( ).
> * on second access, the value is cached in an alist, and looking it up
> is extremely cheap.
>

This is cool. :-)
I don't know enough about this program to even begin to have a gut feeling,
however I guess I'm thinking it seems there would be tons of these reads,
and I'm hearing you say
that in an eventual sense, all data access is an alist access.
I don't know how alists are actually implemented under the hood, but they
feel like they would be a linear scan
with symbol-typed keys on the surface. So to pull out one float you're
doing what 5-10 64bit compares
(checking the keys) and just as many pointer jumps, right? (I'm thinking
the alist is a list of symbol/value
pairs in the implementation also).

This cost strongly dominates the float dereference itself, and there is
this question of how much extra stuff
is happening around it (I guess in my mind I'm comparing it to element
access in C++ which is one pointer (this),
one offset for the member (compiled into an immediate), one load of
this+offset (which the hardware helps you with)).

I feel for the moment I can't provide any concrete insight into any of
this, because I don't know the specifics enough.

> Code following a simple pattern like this, once compiled, will largely be
> dominated by the
> > scripting language runtime overhead
>
> From the outside this may seem plausible, but I doubt your intuition here:
>
> * the callback tables are also alists. They are cheap (they aren't
> sped up if you swap them for hash tables)
>

Not presuming to know your program better than you, but I'd just bring up
that
this is saying that your lists are short (likely length 20ish on average):
the observation
you report is that hashing so you can do a direct access into an array is
not faster than
several pointer-pointer comparisons and pointer chases. The hash you'd use
here be
something like FNV or so, it'll break even somewhere in the 10-20
comparisons, I'd expect.



> * Scheme has no marshaling: objects are either direct (scheme -> C++
> is bit twiddling), or they are indirect (a tagged pair with a C++
> pointer in the cdr)
>

Half of that I expected (more specifically, for various reasons, a number
of which not accurate,
I expected the Scheme APIs to be similar to the TCL APIs, and there as well
you just get handed
straight what TCL has in hand, not marshalling involved). One thing I
didn't know is that the
client calls to extract the machine representation of the value would be
super cheap.
But still, if the guile compiler translates scheme values into native ones
and is able to leave them there
for "long" stretches of code in some cases, and our use case instead
prevents that, it seems it could
eventually add up. Again I do need to learn the source better before you
give these thoughts any real weight.



> IMO The real problem is that we don't have good tooling to see what is
> going on in the Scheme side of things: C++ has decent perf analysis,
> but the Guile side of things just looks like a lot of time spent in
> scm_eval(). Some of it is overhead, some if it might be a poorly
> implemented Scheme function (which is often easier to fix than
> reducing overhead.)
>

Very agreed that poorly conceived code is the first thing to address, no
doubt.
I'd think that the way to gain insight as to what's going on is to inspect
the
bytecode actually, and gain familiarity with the APIs that execute it.
Is it that the bytecode is then translated to executable, or is it running
on a VM?
I would assume they don't provide a decompiler of any sort, do they?

Thanks for a most interesting discussion

L


Re: Blockers for Guile 2.2

2022-02-27 Thread Han-Wen Nienhuys
On Sat, Feb 26, 2022 at 2:02 PM Jonas Hahnfeld  wrote:

> > > The same happens for C++ files, you also have to recompile. But it's
> > > true that editing scm files isn't for free anymore.
> >
> > The Scheme compilation felt much slower, and for C++ ccache takes away
> > a lot of the pain of recompiles. It also appears to be
> > single-threaded? I admit not having timed it in detail.
>
> Yes, GUILE_AUTO_COMPILE=1 is single-threaded which is why it will never
> be the default.

If we precompile as part of the build, I wonder if we could use a hack
like -djob-count to split lilypond into N jobs where each one does a
part of the job.

https://www.gnu.org/software/guile/manual/html_node/Compilation.html

suggests that one can call the compiler explicitly.

-- 
Han-Wen Nienhuys - hanw...@gmail.com - http://www.xs4all.nl/~hanwen



Re: Blockers for Guile 2.2

2022-02-27 Thread Han-Wen Nienhuys
On Sun, Feb 27, 2022 at 10:39 AM Luca Fascione  wrote:
> In other words, is it approximately true that "for (almost) any real-life 
> score the total compilation time
> is proportional to the number of NoteHeads, past a certain size"?
> I'm guessing you need a few pages worth of material to kill away constant 
> overheads, but beyond that
> is it true that if you double the source size you double the compilation time?

it should be, but we have rather complicated page breaking code that
is so hairy that nobody understands it fully. I'm not sure there is
NP-complete snake in the proverbial grass there.

>> I am not sure what you mean by effort spent
>> in "preparing" callbacks, could you elaborate?
>
>
> Imagine the grob is a C++ object that remains opaque to Scheme. So it's a 
> thing for which
> in Scheme you move around a blind pointer, but to read property Y-offset 
> you'd call (grob-get 'Y-offset)
> and grob-get is C++.
> ..

accessing data (eg. offsets):
* on first access, the callback gets executed. This is just evaluating
( ).
* on second access, the value is cached in an alist, and looking it up
is extremely cheap.

> Code following a simple pattern like this, once compiled, will largely be 
> dominated by the
> scripting language runtime overhead

>From the outside this may seem plausible, but I doubt your intuition here:

* the callback tables are also alists. They are cheap (they aren't
sped up if you swap them for hash tables)
* Scheme has no marshaling: objects are either direct (scheme -> C++
is bit twiddling), or they are indirect (a tagged pair with a C++
pointer in the cdr)

IMO The real problem is that we don't have good tooling to see what is
going on in the Scheme side of things: C++ has decent perf analysis,
but the Guile side of things just looks like a lot of time spent in
scm_eval(). Some of it is overhead, some if it might be a poorly
implemented Scheme function (which is often easier to fix than
reducing overhead.)

> (*) 'boxing' is a C# word to mean wrap a POD value in an object to move it 
> around in a way
> that is uniform with the other objects in the language. C# people need to 
> keep boxing and
> unboxing costs in their head in code with lots of PODs being moved around, 
> because that
> cost can dominate the execution of their code. I'm not sure what word is used 
> in Guile for this
> concept.

For integers, floats and booleans, the CPU native representation
involves some bit operations. Intermediate conversions could be
short-cut if you have full type information, and have a sequence of
those in a row.

-- 
Han-Wen Nienhuys - hanw...@gmail.com - http://www.xs4all.nl/~hanwen



Re: Blockers for Guile 2.2

2022-02-27 Thread Han-Wen Nienhuys
On Sun, Feb 27, 2022 at 12:31 AM Jean Abou Samra  wrote:
> > I never said I don't want to fix Guile 2.2 bugs, and you should know
> > as I spent lots and lots of time debugging #6218.  I also said I
> > support moving CI to 2.2, so any MR would pass against 2.2.
> >
> > I am just asking to not drop 1.8 support.
> >
> > Most of the work we do isn't actually about Guile anyway,
> >
> > $ git log --since 2022/01/01 | grep ^commit|wc
> >  257 514   12336
> > $ git log  --grep=[Gg]uile --since 2022/01/01 | grep ^commit|wc
> >7  14 336
>
> I agree that the version-specific code we have (cond-expand
> and GUILEV2) isn't all that much. On the other hand, I would
> be glad to be able to use Guile 2 features, such as Unicode
> support, when and unless, (srfi srfi-71) (let and let*
> accepting multiple values), S-expression comments,
> scm_c_bind_keyword_arguments, and a few others. Now
> that my principal concern with the sustainability of
> Guile 2 binaries (shipping bytecode with them) is cleared,
> I have mixed feelings about when to leave Guile 1 behind.

You say you have mixed feelings, but I think (with your updates to
compilation), those feelings are ever less mixed?

> In my previous post I showed that compiling all Scheme
> files can be done in 20s with Guile 2 (so a few seconds
> for a single file), and 4s with Guile 3 (so near instant
> for one file). Would that address your concern with
> compilation speed?

Thanks a ton for your investigation into this. This is a game changer:

MSDM-reduced

1.8: real 0m14.788s
2.2: real 0m14.648s

les-nereides:

1.8: real 0m1.376s
2.2: real 0m1.224s

Let's kill 1.8 support.

-- 
Han-Wen Nienhuys - hanw...@gmail.com - http://www.xs4all.nl/~hanwen



Re: Blockers for Guile 2.2

2022-02-27 Thread Luca Fascione
On Sat, Feb 26, 2022 at 10:48 PM Jean Abou Samra  wrote:

>
> [Jonas]
> > He, I always thought auto-compilation didn't optimize!  now don't
> > tell me Guile also applies optimizations while just reading and
> > supposedly interpreting code...
>
> I don't think it does. At least, you don't usually call eval or
> primitive-eval on code to run it repeatedly, just for one-shot
> code, so I don't see the sense it would make. Also, it seems
> to me that the Guile developers are much more focused on compilation.
>

Randomly, it seems to me my own problem with fingering might possibly bring
up
a counter example pattern (for eval vs compile): when you attach a callback
to something that is evaluated
often (in the case of fingering there is an after-line-break callback,
which I am guessing
runs O(count(Fingering)) ~= O(count(NoteHead)). Isn't this a case of scheme
code coming in from
the .ly file (likely an include ly file) and being run O(N) times?

I can't say I understand lilypond super well, but it doesn't feel like it
would employ often
alogrithms that are superlinear in the NoteHead count (it seems most
activities would be
organized as linear scans), so I'm guessing O(N) is as bad as it gets for a
cursory evaluation?

In other words, is it approximately true that "for (almost) any real-life
score the total compilation time
is proportional to the number of NoteHeads, past a certain size"?
I'm guessing you need a few pages worth of material to kill away constant
overheads, but beyond that
is it true that if you double the source size you double the compilation
time?

The background for this was what David was saying, whether code in .ly
files would be optimized or not.
At a guess, I'd think stuff that is big and runs O(n) times might make some
sense to see if need to optimize.
I am sensitive to your other passage of the nightmare it will be to keep
this stuff around and properly invalidate
it upon changes.

> [ skipping over the part regarding Guile 3, since I don't think it's
> > relevant here ]
>
> Perhaps I should have changed the title, but I do think it
> is relevant -- it gives hope for a state where development
> cycles will be easier. When we want Guile 3 is another question.
> I'm in favor of not making the move to Guile 2 wait more
> than it already has, but I wonder if at some point in this release
> cycle (by which I mean before the next stable release) we will want
> to switch to Guile 3.
>

fwiw, this sounds super reasonable to me.

I was just trying to get orders
> of magnitude (for Guile 3 it's 1m30 vs. 4s, no need for
> precise benchmarks to see which is faster :-).
>

Indeed. I was more thinking that not only the opt/no-opt numbers are close,
but also 18-ish to 19ish seconds are close, it is possible that difference
too
is spurious for some reason (I guess I'm saying: there's a possibility you
have been a little lucky or a little unlucky, and your actual runtime
difference is
closer to 2% or maybe 15%, but you happened to take samples at "weird"
times.

I might have some time next week (after 7 march) to run these tests several
times, depends on
some stuff I have going on for next weekend. I'll contact you oob if I do
for some
quick guidance exchange.

I am no performance expert, but LilyPond's performance-critical parts
> are written in C++, so in general I would think the cost of Scheme code
> is spread a bit all over the code base (unlike C++ where you can find
> some rather critical parts taking lots of time, like page breaking,
> or skyline algorithms).


Well but this would be single call from scheme into C++ that take "a long
time",
do I read you right?

Instead I was thinking more of the "dying from a thousand cuts" kinda
scenario.
[continue below]


> I am not sure what you mean by effort spent
> in "preparing" callbacks, could you elaborate?
>

Imagine the grob is a C++ object that remains opaque to Scheme. So it's a
thing for which
in Scheme you move around a blind pointer, but to read property Y-offset
you'd call (grob-get 'Y-offset)
and grob-get is C++.

And then in C++ you'd just have a one-liner accessor that is conceptually
  float scheme_grob_get(Grob* grob, sch_value* propname ) {
// typecheck propname to be a symbol
// find out if you have a symbol and what accessor you want to delegate
to
return grob.get_y_offset();  // actual delegation
  }
and in turn the getter is something like
  inline float get_y_offset() const { return y_offset; /* this is a float
member, say */ }

Code following a simple pattern like this, once compiled, will largely be
dominated by the
scripting language runtime overhead in traversing through the callback
table, finding the
function pointer for the dispatch, marshalling the values from scheme
representation to
something you can mess with in C++ and the way back upon return.
I've read enough of your C++ to see that a lot of this happens in client
code, but either way
it's all code that dominates in cost the execution of the accessor