> Date: Sat, 11 Feb 2023 21:40:58 +0100 > From: pertu...@free.fr > > For french, and I belive all the languages with accented letters that > should sort next to the non accented letter, for instance e and é, the > sort is much better with Unicode::Collate.
But if the index entries actually don't have any accented letters (as is the case with the Emacs Lisp Reference manual), why would that slow down sorting so much? Collation weights of basic characters should be trivial, even if Unicode::Collate implements the full Unicode Collation algorithm. > > > How come format_printindex takes such a large proportion of the > > > processing? Isn't that strange? Index entries are usually a small > > > proportion of the overall manual's text, so processing the manual > > > should take the lion's share. The index in the manual you were timing > > > has about 8K entries, but the entire manual is 100K lines, so the > > > index is less than 10% of the total volume. How come its processing > > > is so expensive? > > > > It's the sorting of the index entries into alphabetical order, I presume. > > There isn't a similar sorting process for the rest of the manual. > > Exactly. Given the size of the index, it may be the most extreme > slowdown, if it is more than linear in the size of the index. As to why > Unicode::Collate is slow, I do not think it is easy to know. It could > depend on the Unicode::Collate too. Even for 8K index entries, it is still a very long time. For reference, sorting the Index of the Emacs Lisp Reference manual using the Emacs command sort-lines takes just 0.25 sec. I deliberately forced sort-lines to sort in reverse order, to avoid a no-op sorting in ascending order, since the Index is already sorted. So I still don't understand the numbers presented by Gavin. I think more investigation is in order. Maybe it's worthwhile to emulate collation sorting of 8K strings with a C program, and if the sort time is significantly shorter that way, perhaps the index-sorting code could benefit from a Perl extension? Btw, I'm not sure I understand the time data presented by Gavin: > Top 15 Subroutines > Calls P F Exc Inc Subroutine > 2280071 1 1 23.1s 25.5s Unicode::Collate::getWt > 122770 1 1 14.4s 15.6s Unicode::Collate::splitEnt > 351998 22 1 7.86s 67.2s Texinfo::Convert::Plaintext::_convert > 122770 1 1 6.86s 48.8s Unicode::Collate::getSortKey > 270366 28 1 1.52s 1.59s Texinfo::Convert::Plaintext::_count_added > 2280071 1 1 973ms 973ms Unicode::Collate::varCE (xsub) > 167542 1 1 899ms 1.26s Texinfo::Convert::Plaintext::_process_text > 184832 8 2 842ms 842ms Texinfo::Convert::Paragraph::add_text (xsub) > 2280071 1 1 724ms 724ms Unicode::Collate::_fetch_simple (xsub) > 2280071 1 1 550ms 550ms Unicode::Collate::_ignorable_simple (xsub) > 4564446 8 1 530ms 530ms Unicode::Collate::CORE:match (opcode) > 2280071 1 1 508ms 508ms Unicode::Collate::_exists_simple (xsub) > 62010 1 1 463ms 49.7s Texinfo::Structuring::_collator_sort_string > 122770 1 1 444ms 622ms Unicode::Collate::process > 1 1 1 434ms 434ms Texinfo::Parser::parse_file (xsub) This seems to say that Unicode::Collate::getWt alone took 23.1 sec?? and Unicode::Collate::getSortKey with all its callees took 48.8 sec?? and the entire conversion took 67.2 sec?? On my system, which is a 12-year old Windows XP, producing the Emacs Lisp Reference manual for Emacs 27.2 takes just 18.1 sec of CPU time, so how come Gavin reports such huge timings? I measured with Texinfo 7.0.1 -- are you saying that the current version from the Texinfo Git's master branch is so much slower? Did we not use Unicode::Collate in Texinfo 7.0.x?