Re: texi2any is too slow because of Unicode::Collate

Eli Zaretskii Sat, 11 Feb 2023 21:53:43 -0800

> Date: Sat, 11 Feb 2023 21:40:58 +0100
> From: pertu...@free.fr
> 
> For french, and I belive all the languages with accented letters that
> should sort next to the non accented letter, for instance e and é, the
> sort is much better with Unicode::Collate.


But if the index entries actually don't have any accented letters (as
is the case with the Emacs Lisp Reference manual), why would that slow
down sorting so much?  Collation weights of basic characters should be
trivial, even if Unicode::Collate implements the full Unicode
Collation algorithm.

> > > How come format_printindex takes such a large proportion of the
> > > processing?  Isn't that strange?  Index entries are usually a small
> > > proportion of the overall manual's text, so processing the manual
> > > should take the lion's share.  The index in the manual you were timing
> > > has about 8K entries, but the entire manual is 100K lines, so the
> > > index is less than 10% of the total volume.  How come its processing
> > > is so expensive?
> > 
> > It's the sorting of the index entries into alphabetical order, I presume.
> > There isn't a similar sorting process for the rest of the manual.
> 
> Exactly.  Given the size of the index, it may be the most extreme
> slowdown, if it is more than linear in the size of the index.  As to why
> Unicode::Collate is slow, I do not think it is easy to know.  It could
> depend on the Unicode::Collate too.

Even for 8K index entries, it is still a very long time.

For reference, sorting the Index of the Emacs Lisp Reference manual
using the Emacs command sort-lines takes just 0.25 sec.  I
deliberately forced sort-lines to sort in reverse order, to avoid a
no-op sorting in ascending order, since the Index is already sorted.
So I still don't understand the numbers presented by Gavin.  I think
more investigation is in order.

Maybe it's worthwhile to emulate collation sorting of 8K strings with
a C program, and if the sort time is significantly shorter that way,
perhaps the index-sorting code could benefit from a Perl extension?

Btw, I'm not sure I understand the time data presented by Gavin:

> Top 15 Subroutines
> Calls   P  F Exc   Inc      Subroutine
> 2280071 1  1 23.1s 25.5s    Unicode::Collate::getWt
> 122770  1  1 14.4s 15.6s    Unicode::Collate::splitEnt
> 351998  22 1 7.86s 67.2s    Texinfo::Convert::Plaintext::_convert
> 122770  1  1 6.86s 48.8s    Unicode::Collate::getSortKey
> 270366  28 1 1.52s 1.59s    Texinfo::Convert::Plaintext::_count_added
> 2280071 1  1 973ms 973ms    Unicode::Collate::varCE (xsub)
> 167542  1  1 899ms 1.26s    Texinfo::Convert::Plaintext::_process_text
> 184832  8  2 842ms 842ms    Texinfo::Convert::Paragraph::add_text (xsub)
> 2280071 1  1 724ms 724ms    Unicode::Collate::_fetch_simple (xsub)
> 2280071 1  1 550ms 550ms    Unicode::Collate::_ignorable_simple (xsub)
> 4564446 8  1 530ms 530ms    Unicode::Collate::CORE:match (opcode)
> 2280071 1  1 508ms 508ms    Unicode::Collate::_exists_simple (xsub)
> 62010   1  1 463ms 49.7s    Texinfo::Structuring::_collator_sort_string
> 122770  1  1 444ms 622ms    Unicode::Collate::process
> 1       1  1 434ms 434ms    Texinfo::Parser::parse_file (xsub)

This seems to say that Unicode::Collate::getWt alone took 23.1 sec??
and Unicode::Collate::getSortKey with all its callees took 48.8 sec??
and the entire conversion took 67.2 sec??  On my system, which is a
12-year old Windows XP, producing the Emacs Lisp Reference manual for
Emacs 27.2 takes just 18.1 sec of CPU time, so how come Gavin reports
such huge timings?  I measured with Texinfo 7.0.1 -- are you saying
that the current version from the Texinfo Git's master branch is so
much slower?  Did we not use Unicode::Collate in Texinfo 7.0.x?

Re: texi2any is too slow because of Unicode::Collate

Reply via email to