Hi Ross, On 2/21/21 10:42 PM, Ross Moore wrote: > Hi Ulrike, > >> On 22 Feb 2021, at 7:52 am, Ulrike Fischer wrote: >> >> Am Sun, 21 Feb 2021 20:26:04 +0000 schrieb Ross Moore: >> >> > Once you have encountered the (correct) comment character, >> > what follows on the rest of the line is going to be discarded, >> > so its encoding is surely irrelevant. >> > >> > Why should the whole line need to be fully tokenised, >> > before the decision is taken as to what part of it is retained? >> >> Well you need to find the end of the line to know where to stop with >> the discarding don't you? So you need to inspect the part after the >> comment char until you find something that says "newline”. > > My understanding is that this *is* done first. > Similarly to TeX's \read to <csname> which grabs a line of input from a > file, > before doing the tokenisation and storing the result in the <csname>. > page 217 of The TeXbook > > If I’m wrong with this, for high-speed input, then yes you need to know where > to > stop. > But that’s just as easy, since you stop when a byte is to be tokenised > as an end-of-line character, and these are known. > You need this anyway, even when you have tokenised every byte. > > > So all we are saying is that when handling the bytes between > a comment and its end-of-line, just be a bit more careful. > > It’s not necessary for each byte to be tokenised as valid for UTF-8. > Maybe change the (Warning) message when you know that you are within > such a comment, to say so. That would be more meaningful to a > package-writer, > and to an author who uses the package, looks in the .log file, and sees the > message. > > None of this is changing how the file is ultimately processed; > it’s just about being friendlier in the human interface.
I think your model of what XeTeX is doing is missing a step. It's important to distinguish two steps, which are a bit mixed up in some of the comments here. I'm not 100\% sure either, so perhaps more knowledgeable people can chime in. - The file is read line by line; this step requires finding the end of lines, hence must depend on some encoding (possibly XeTeX allows changing the encoding for lines that are not yet read). This puts *characters* (not bytes) in a buffer. This is also the step where the \endlinechar is inserted, so any change to \endlinechar on a given line can only affect the next line. - The characters are then turned into tokens, one token at a time. Catcodes can be changed within a line, and they affect what characters will combine into tokens, even within the same line. The problem here is at the first step, where XeTeX cannot find a valid line of characters in the given encoding. It might be possible to use package hooks to change the encoding state for that particular package, but I haven't followed carefully these new LaTeX developments. Best, Bruno