> The names of which character-sets I might be working 
> with here

Depending on how you are getting the data out of your III system, and whether 
or not you've upgraded the system to Unicode, the catalog data is likely in the 
MARC-8 character set.

If you're looking to convert that data to UTF-8 (which I assume you would), 
then your best friend is a program from Index Data called yaz-marcdump, which 
comes with the Yaz toolkit.  It runs on Linux and Windows, and can be invoked 
from the command line or from scripts to quickly and painlessly convert your 
catalog data into UTF-8.

  http://www.indexdata.com/yaz

--Dave

==================
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu
________________________________________
From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Ken Irwin 
[kir...@wittenberg.edu]
Sent: Wednesday, December 16, 2009 9:02 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] character-sets for dummies?

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.

My immediate issue is that I think I need to integrate data from a variety of 
character sets into one MySQL table - I expect I need some way to convert from 
one to another, but I don't really even know how to tell which data are in 
which format.

Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
from publishers, vendors, the library catalog (III), etc. When I look at the 
data in emacs, some of it renders like this:
 Revista de Oncolog\303\255a                  [slashes-and-digits instead of 
diacritics]
And other data looks more like:
 Revista de Música Latinoamericana    [weird characters instead of diacritics]

My MySQL table is currently set up with the collation set to: utf8-bin , and 
the titles from the second category (weird characters display in emacs) render 
properly when the database data is output to the a web browser. The data from 
the former example (\###) renders as an "I don't know what character this is" 
placeholder in Firefox and IE.

So, can someone please point me toward any or all of the following?

·         A good primer for understanding all of this stuff

·         A method for converting all of my data to the same character set so 
it plays nicely in the database

·         The names of which character-sets I might be working with here

Many thanks!

Ken

Reply via email to