Hi Brian,

Thanks for your response.

> I'd suggest you first make sure your XML is really UTF-8

I believe it is.  I used a hex editor to look at the XML source file and the 
character in question (the "Registered Sign") is encoded as hex "c2 ae" which 
is the proper UTF-8 encoding for that character [1].  There were other XML 
files processed with the same script that had non-ASCII characters (in the 520 
field where we are sticking the theses abstracts) and also verified as being 
UTF-8 encoded, and they did not seem to cause any errors.  The 520 field isn't 
processed any differently in my script (I'm double-checking, natch) so that's 
partly why I am confused.
 
> ...using JHOVE

I was not familiar with JHOVE, but looked it up and it sounds like a very 
useful tool [2].  I have downloaded it, and will be trying it out.

-- Michael

[1] FileFormat.Info > Unicode Character 'REGISTERED SIGN' (U+00AE)
    http://www.fileformat.info/info/unicode/char/00ae/index.htm

[2] JHOVE - JSTOR/Harvard Object Validation Environment
    http://hul.harvard.edu/jhove/

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Brian Sheppard [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, February 21, 2008 1:00 PM
> To: Doran, Michael D
> Cc: perl4lib@perl.org
> Subject: Re: Help for utf-8 output
> 
> I'd suggest you first make sure your XML is really UTF-8, using JHOVE:
> 
>    /path/to/jhove/jhove -c /path/to/jhove/conf/jhove.conf -m 
> utf8-hul myFile.xml
> 
> If it fails you could convert to utf8, on the (perhaps 
> unwarranted) assumption it's windows latin1:
> 
>     iconv -c -f windows-1252 -t UTF-8 myFile.xml > myFile.utf8.xml
> 
> Then, of course, test myFile.utf8.xml with jhove to see if it's valid.
> 
> -Brian
> 
> 
> On February 21, at 11:48 AM, Doran, Michael D wrote:
> 
> > Hi Jackie,
> >
> > I'm working on a very similar problem... converting theses/ 
> > dissertations records (in XML) to MARC records.  I'm still in the 
> > testing stage, but have had similar problems with records with 
> > diacritics in the 100 or 245 fields (however diacritics in a 520a 
> > field don't seem to cause any problems).  Since our records are not 
> > "diacritic rich" it's hard to determine the exact extent of the 
> > problem.
> >
> > I am using these versions:
> >   Perl v5.8.8
> >   MARC::Charset 0.98
> >   MARC::Lint 1.43
> >   MARC::Record 2.0
> >   XML::LibXML 1.66
> >
> > Here's an example "bad" record (which I have minimized to just the
> > 245 field):
> >
> > marcdump test.mrc
> > test.mrc
> > LDR 00127cam a2200037   4500
> > 245 13 _aAn Empirical Test Of The Situational Leadership® Model In 
> > Japan /
> >        _cRiho Yoshioka.
> >
> >  Recs  Errs Filename
> > ----- ----- --------
> >     1     1 test.mrc
> >
> > When I run test.mrc through MARC::Lint, I get this message:
> >
> >  Invalid record length in record 1: Leader says 00127 bytes 
> but it's 
> > actually 125  Invalid length in directory for tag 245 in record 1  
> > field does not end in end of field character in tag 245 in record 1
> >
> > When examined in vi the character in question, a Registered Sign, 
> > appears to be correctly UTF-8 encoded C2AE, and the bib Leader 
> > (position 09=a) indicates that it is Unicode encoded.  I've 
> attached 
> > the MARC record.
> >
> > I noticed that when I run your record (ck245.dat) through 
> MARC::Lint, 
> > I get the same invalid record length message:
> >
> >  Invalid record length in record 3: Leader says 00567 bytes 
> but it's 
> > actually 569  field does not end in end of field character 
> in tag 100 
> > in record 3  field does not end in end of field character 
> in tag 245 
> > in record 3  Invalid indicators ".10" forced to blanks in 
> record 3 for 
> > tag 245
> >
> >  field does not end in end of field character in tag 260 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record 3 for tag 260
> >
> >  field does not end in end of field character in tag 300 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record 3 for tag 300
> >
> >  field does not end in end of field character in tag 502 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record 3 for tag 502
> >
> >  field does not end in end of field character in tag 504 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record 3 for tag 504
> >
> >  field does not end in end of field character in tag 690 in 
> record 3  
> > Invalid indicators ". 4" forced to blanks in record 3 for tag 690
> >
> > Anybody have any ideas?
> >
> > -- Michael
> >
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 mobile
> > # [EMAIL PROTECTED]
> > # http://rocky.uta.edu/doran/
> >
> >
> >> -----Original Message-----
> >> From: Shieh, Jackie [mailto:[EMAIL PROTECTED]
> >> Sent: Tuesday, February 19, 2008 10:50 AM
> >> To: perl4lib@perl.org
> >> Subject: Help for utf-8 output
> >>
> >> I was wondering if anyone has similar experience and has 
> come up with 
> >> good solutions to help solving the challenge below?!
> >>
> >> What I have is an Excel spreadsheet for dissertations which I have 
> >> saved as a tab delimited file (examining the file in TextPad, the 
> >> diacritics appears to be fine), then read in and output 
> the file as a 
> >> utf-8 MARC file. I  <print> title field confirming author 
> field that 
> >> contains diacritics with the title showing proper indicator values.
> >>
> >> But when I looked the MARC itself, the fields that follow 
> the field 
> >> containing diacritics are all off its original position. 
> See attached 
> >> zip file.  Examples below: first two have diacritics in a 
> 100 field, 
> >> last one diacritic is in 245 subfield b)
> >>
> >> 001     diss 34001
> >> 100 1  _aP<E9>rez, Nancy L.
> >> 245     _aSynchronic and Diachronic Matlatzinkan Phonology.
> >>
> >> 001     diss 34042
> >> 100 1  _aValent<ED>n-M<E1>rquez, Wilfredo
> >> 245     _aDoing being boricua :
> >>
> >> 001     diss 33892
> >> 100 1   _aDavis, Jennifer M.
> >> 245 14 _aThe Functional Complexities of Inherited Cardiac 
> Troponin I 
> >> Mutations :
> >>             _bIdentification of Ca<B2>+ Independent Contractile 
> >> Dysfunction.
> >>
> >> I would be greatly appreciate any suggestion to solve this.
> >> Thank you most kindly.
> >>
> >> Regards,
> >>
> >> --Jackie
> >>
> >> |Jackie Shieh
> >> |Data Loads & Development
> >> |Harlan Hatcher Graduate Library
> >> |University of Michigan
> >> |920 North University
> >> |Ann Arbor, MI 48109-1205
> >> |Phone: 734.763.6070 FAX: 734.615.9788
> >> |E-mail: JShieh [AT] umich [DOT] edu
> >>
> >> <test.mrc>
> 
> --------------------------------------------------
> Brian Sheppard
> University of Wisconsin Digital Collections Center
> [EMAIL PROTECTED]    (608) 262-3349
> 
> 
> 
> 

Reply via email to