Hi,

in fact the question is quite complex to explain, and I'm not sure that I can explain well.

At 14.57 16/12/03, you wrote:

This process works just fine for records that contain no diacritics, but
when diacritics are in the records extra characters end up in my saved
files, like this:

  00901nam  22002651
    ^^^
  45000010008000000050017000080080041000250350021000669060045000870
  10001700132040001800149050001800167082001000185100002900195245009
  20022426000340031630000470035049000290039750400260042660000340045
  27100021004869910044005079910055005510990029006063118006
  19740417000000.0731207s1967    nyuabf   b    000 0beng
  9(DLC)   67029856  a7bcbccorignewdueocipf19gy-gencatlg
  a   67029856   aDLCcDLCdDLC00aND588.D9bR8500a759.31
  aRussell, Francis,d1910-14aThe world of Dˆ®urer,
                                              ^^^^^^^
  1471-1528,cby Francis Russell and the editors of Time-Life
  Books.  aNew York,bTime, inc.c[1967]  a183 p.billus.,
  maps, col. plates.c32 cm.0 aTime-Life library of art
  aBibliography: p. 177.10aDˆ®urer, Albrecht,d1471-1528.2
                              ^^^^^^^
  aTime-Life Books.  bc-GenCollhND588.D9iR85tCopy 1wBOOKS
  bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF
  arussell-world-1071495663

Notice how Dürer got munged into Dˆ®urer, twice, and consequently the record
length is not 901 but 903 instead.

Some people say I must be sure to request a specific character set from the
LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which
one of these character sets do I want and how do I tell the remote database
which one I want?

1)When you call LOC without a specific character you recive data in MARC-8 character set.


2) In MARC-8 character set a letter like "è" [e grave] is done with TWO bytes one for the sign [the grave accent] and one for the letter [the letter e].

3)In the leader, position 0-4 you have the number of character, NOT the number of bytes. In your record there are 901 characters and 903 bytes.

In fact the "lenght" function of perl read the number of bytes. The best option, now, is to use charset where 1 character is always 1 byte, for example ISO 8859_1
A good place to undestand charset sets is http://www.gymel.com/charsets/ [in deutch]


Bye

Zeno Tajoli
[EMAIL PROTECTED]
CILEA - Segrate (MI)
02 / 26995321



Reply via email to