Hi Brian, Thanks for your response.
> I'd suggest you first make sure your XML is really UTF-8 I believe it is. I used a hex editor to look at the XML source file and the character in question (the "Registered Sign") is encoded as hex "c2 ae" which is the proper UTF-8 encoding for that character [1]. There were other XML files processed with the same script that had non-ASCII characters (in the 520 field where we are sticking the theses abstracts) and also verified as being UTF-8 encoded, and they did not seem to cause any errors. The 520 field isn't processed any differently in my script (I'm double-checking, natch) so that's partly why I am confused. > ...using JHOVE I was not familiar with JHOVE, but looked it up and it sounds like a very useful tool [2]. I have downloaded it, and will be trying it out. -- Michael [1] FileFormat.Info > Unicode Character 'REGISTERED SIGN' (U+00AE) http://www.fileformat.info/info/unicode/char/00ae/index.htm [2] JHOVE - JSTOR/Harvard Object Validation Environment http://hul.harvard.edu/jhove/ # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Brian Sheppard [mailto:[EMAIL PROTECTED] > Sent: Thursday, February 21, 2008 1:00 PM > To: Doran, Michael D > Cc: perl4lib@perl.org > Subject: Re: Help for utf-8 output > > I'd suggest you first make sure your XML is really UTF-8, using JHOVE: > > /path/to/jhove/jhove -c /path/to/jhove/conf/jhove.conf -m > utf8-hul myFile.xml > > If it fails you could convert to utf8, on the (perhaps > unwarranted) assumption it's windows latin1: > > iconv -c -f windows-1252 -t UTF-8 myFile.xml > myFile.utf8.xml > > Then, of course, test myFile.utf8.xml with jhove to see if it's valid. > > -Brian > > > On February 21, at 11:48 AM, Doran, Michael D wrote: > > > Hi Jackie, > > > > I'm working on a very similar problem... converting theses/ > > dissertations records (in XML) to MARC records. I'm still in the > > testing stage, but have had similar problems with records with > > diacritics in the 100 or 245 fields (however diacritics in a 520a > > field don't seem to cause any problems). Since our records are not > > "diacritic rich" it's hard to determine the exact extent of the > > problem. > > > > I am using these versions: > > Perl v5.8.8 > > MARC::Charset 0.98 > > MARC::Lint 1.43 > > MARC::Record 2.0 > > XML::LibXML 1.66 > > > > Here's an example "bad" record (which I have minimized to just the > > 245 field): > > > > marcdump test.mrc > > test.mrc > > LDR 00127cam a2200037 4500 > > 245 13 _aAn Empirical Test Of The Situational Leadership® Model In > > Japan / > > _cRiho Yoshioka. > > > > Recs Errs Filename > > ----- ----- -------- > > 1 1 test.mrc > > > > When I run test.mrc through MARC::Lint, I get this message: > > > > Invalid record length in record 1: Leader says 00127 bytes > but it's > > actually 125 Invalid length in directory for tag 245 in record 1 > > field does not end in end of field character in tag 245 in record 1 > > > > When examined in vi the character in question, a Registered Sign, > > appears to be correctly UTF-8 encoded C2AE, and the bib Leader > > (position 09=a) indicates that it is Unicode encoded. I've > attached > > the MARC record. > > > > I noticed that when I run your record (ck245.dat) through > MARC::Lint, > > I get the same invalid record length message: > > > > Invalid record length in record 3: Leader says 00567 bytes > but it's > > actually 569 field does not end in end of field character > in tag 100 > > in record 3 field does not end in end of field character > in tag 245 > > in record 3 Invalid indicators ".10" forced to blanks in > record 3 for > > tag 245 > > > > field does not end in end of field character in tag 260 in > record 3 > > Invalid indicators ". " forced to blanks in record 3 for tag 260 > > > > field does not end in end of field character in tag 300 in > record 3 > > Invalid indicators ". " forced to blanks in record 3 for tag 300 > > > > field does not end in end of field character in tag 502 in > record 3 > > Invalid indicators ". " forced to blanks in record 3 for tag 502 > > > > field does not end in end of field character in tag 504 in > record 3 > > Invalid indicators ". " forced to blanks in record 3 for tag 504 > > > > field does not end in end of field character in tag 690 in > record 3 > > Invalid indicators ". 4" forced to blanks in record 3 for tag 690 > > > > Anybody have any ideas? > > > > -- Michael > > > > # Michael Doran, Systems Librarian > > # University of Texas at Arlington > > # 817-272-5326 office > > # 817-688-1926 mobile > > # [EMAIL PROTECTED] > > # http://rocky.uta.edu/doran/ > > > > > >> -----Original Message----- > >> From: Shieh, Jackie [mailto:[EMAIL PROTECTED] > >> Sent: Tuesday, February 19, 2008 10:50 AM > >> To: perl4lib@perl.org > >> Subject: Help for utf-8 output > >> > >> I was wondering if anyone has similar experience and has > come up with > >> good solutions to help solving the challenge below?! > >> > >> What I have is an Excel spreadsheet for dissertations which I have > >> saved as a tab delimited file (examining the file in TextPad, the > >> diacritics appears to be fine), then read in and output > the file as a > >> utf-8 MARC file. I <print> title field confirming author > field that > >> contains diacritics with the title showing proper indicator values. > >> > >> But when I looked the MARC itself, the fields that follow > the field > >> containing diacritics are all off its original position. > See attached > >> zip file. Examples below: first two have diacritics in a > 100 field, > >> last one diacritic is in 245 subfield b) > >> > >> 001 diss 34001 > >> 100 1 _aP<E9>rez, Nancy L. > >> 245 _aSynchronic and Diachronic Matlatzinkan Phonology. > >> > >> 001 diss 34042 > >> 100 1 _aValent<ED>n-M<E1>rquez, Wilfredo > >> 245 _aDoing being boricua : > >> > >> 001 diss 33892 > >> 100 1 _aDavis, Jennifer M. > >> 245 14 _aThe Functional Complexities of Inherited Cardiac > Troponin I > >> Mutations : > >> _bIdentification of Ca<B2>+ Independent Contractile > >> Dysfunction. > >> > >> I would be greatly appreciate any suggestion to solve this. > >> Thank you most kindly. > >> > >> Regards, > >> > >> --Jackie > >> > >> |Jackie Shieh > >> |Data Loads & Development > >> |Harlan Hatcher Graduate Library > >> |University of Michigan > >> |920 North University > >> |Ann Arbor, MI 48109-1205 > >> |Phone: 734.763.6070 FAX: 734.615.9788 > >> |E-mail: JShieh [AT] umich [DOT] edu > >> > >> <test.mrc> > > -------------------------------------------------- > Brian Sheppard > University of Wisconsin Digital Collections Center > [EMAIL PROTECTED] (608) 262-3349 > > > >