[Full quote, because I mistakenly moved the discussion to the lualatex list. Moving now to [email protected], please reply there. Sorry for the multiple mails.]
Am 03.10.2012 20:06, schrieb Stephan Hennig: > [CC'ing [email protected] and [email protected], > since spell checking is of international concern. > Please reply to [email protected].] > > Am 02.10.2012 16:01, schrieb Pander: > >> You can mention that the Dutch patterns are being processed by OpenTaal. >> They are put on hold since we are working very hard on the next version >> of spell checking at the moment. > > You're speaking about spell checking, not hyphenation, right? Could you > please elaborate a bit? > > I've recently thought about spell checking of TeX documents and came up > with the following idea that requires LuaTeX's node list manipulations: > > 1. In the first LuaTeX run, write all typeset text into a UTF-8 encoded > text file. > > 2. Feed that text file to your favourite spell checker, generating a > list of bad words. > > 3. In the second run, LuaTeX reads-in the list of bad words and puts a > red wavy line under all bad words in the document. A possible approach > is to mark nodes corresponding to a bad word in pre_linebreak_filter > with an attribute so that they can be identified later. > > > Pro: > > + The approach is spell checking application agnostic. It only > requires that the spell checker can output a list of bad words > (aspell and hunspell can do so). > > + The spell checker doesn't need to know TeX syntax. Even though, > aspell as well as hunspell can cope with TeX source files, they > cannot spell check TeX generated text that is not explicitly in > the source file. Additionally, commercial spell checkers likely > do not know about TeX (such as Duden Korrektor, a spell checker > for the German language). > > + You can optionally use multiple spell checkers at once. > > + Point'n'click people have their red wavy lines in the PDF, while > others can still just look at the list of bad words. > > + The approach might work with Grammar checkers as well. Don't know. > > > Cons: > > - Red wavy lines are only marketing ... > > > I have attached a small package totext (license is LPPL) trying to > implement step 1 outlined above. To test it, add the line > > \usepackage{totext} > > to a LaTeX file and process that with LuaLaTeX. The package should work > with other formats as well, but then users need to adapt file > totext.sty, which consists of only 2 lines. During the TeX run, a file > <jobname>.txt is created that should contain most of the text of the TeX > output. The output is broken to a fixed line length, that is currently > hard-coded to 72 characters per line (can be adjusted on ln. 164 in > totext.lua). Attached is file sample2e.txt, which contains the output > of a compile run of sample2e.tex. > > The package currently hooks into the pre_linebreak_filter and > hpack_filter callbacks. I'm not sure what the best callbacks are, but > to avoid irritating the spell checker words should preferably not be > hyphenated in the text file. The red wavy lines, on the other hand, > need to be inserted after all text is laid-out on the page (perhaps in > buildpage_filter?). The code is now available on GitHub, <URL:https://github.com/hennigs/spelling>. > What doesn't work: > > * The package currently doesn't deal with mathematics. See issue #8. > * Ligatures are not resolved into their constituent letters. I've added a code point substitution feature. The most important latin ligatures, like 'ff', 'fi' etc., are now translated into 'ff', 'fi' etc. to help the incapable spell-checker. The translation table is currently hard-coded. A TeX interface for fine-grained substitution control would be nice, e.g., for switching of substitution of long s (ſ) by s. Contributions are warmly welcome, especially those for the TeX parts. I'd love to see issues #1 and #2 resolved soon to make a first upload to CTAN. > * Footnote marks are missing in the text. That works now. > * It fails miserably on the \LaTeX logo. The package adopted the > definition of a word from the chickenize package (start with a > glyph node, end with a node whose id is neither of 37 glyph, > 7 disc, 11 kern, 22 ???). It seems like more nodes have to be > considered as being possible parts of a word. The definition of the LaTeX logo contains a \vbox. That is best repaired by providing a definition of the Logo without a \vbox within a word (the TeX logo does without), see issue #12. I'm on the road for the rest of the week and perhaps a bit less responsive. Oh, and did I mention that I'd be happy to hand-over maintenance of this package to someone else? Check it out! Happy TeXing, Stephan Hennig
