On 04/07/2012 04:02 PM, Paul Johnson wrote:
Greetings, LyX Land:
I've encouraged people to learn to use LyX, so when they run into trouble,
I feel responsible to try and help. I use Linux to prepare documents, so
I have not experienced this problem before. Many people still use Windows
and MS word and such, and so they do things that I would not expect, and
I am frustrated when these things arise.
I think the question I need to ask you is this: How can I find out what encoding
is currently used in the LyX document and what should it be to make it
work properly?
And how can I wrestle all of the characters into the correct encoding? Is there
no magic want to scan a lyx text file and change everything to a desired
encoding?
Here's the long version:
A student has LyX documents have lots and lots of invalid characters. I'm virtually certain
most of these were inserted into LyX by a Copy& Paste from MS Word and/or Adobe Acrobat.
In all of the places where Word used an apostrophe, we seem to have an illegal character. I
think quotation marks as well. Probably other characters. I'm pretty sure the quotation marks
and apostrophe problems result from Word's use of "smart quotes" by default.
I wondered if we shouldn't open the LyX document in Emacs and then search
and replace the bad characters. If I knew how to insert characters that LyX
would accept, I would do that.
LyX uses UCS-4 (or UTF-32) encoding for its files. You can use:
file -i myfile.lyx
to determine what the encoding actually is, though this does not always
get it right. If there aren't any extended characters, it will tell you
it's ASCII, etc.
You can use the iconv program to convert the encoding automatically.
Otherwise, it's a guessing game until you get them all fixed.
I think she has a lot of the same trouble with her Bibliography, which
is a bib file exported from Zotero. I have had the problem in my own work that
Zotero will export unexpected encodings, such as the long dash in place of --
in page numbers. But in the student's document, all of the dates of the
citations show up as ???? when LaTeX processes the document.
We see this fairly frequently, for this kind of reason. Possibly invalid
characters in one of the author names, which will cause everything to
fail. Some of the discussion here
http://www.lyx.org/trac/ticket/6223
might be relevant.
So, how to fix this up?
First, How should she configure "Document Settings/ Language"?
She's from South East Asia, but writing in English. So perhaps her PC
has more international language features than I'm used to. For LyX
language encoding, "default" is not good? How about utf8?
Or one of the other unicode options.
If she's writing in English, this shouldn't matter a lot. Note that this
controls the encoding of the output TeX file, not the encoding of the
LyX file.
Incidentally, LyX has the Font button to select XeTeX, supported fonts.
why doesn't that fix the encoding problem? A font selection is not the same as
encoding?
XeTeX will use Unicode for export, so that sets the encoding, but it
doesn't necessarily solve the original problem.
Second, we need to force the document to use only the desired encoding.
It is a bit outside my comprehension that a document would allow one to
paste in an invalid character, but that's just me.
I agree that there is an issue here, and that we should try to detect
the encoding of the pasted stuff, etc. I'm unfortunately pretty clueless
about these encoding issues myself, though, so I don't know exactly
where the problems are.
But isn't there a way to convert the characters in one command?
In Linux, I'd try a program like "iconv", if I had a good guess for what
the "from" encoding should be.
We use iconv internally for many conversions.
Richard