Hi Henri,

> Is there a reason why MARC::File::XML considers only a very 
> strict subset of utf-8 as valid ?

I would guess that it has to do with adhering to the MARC-21 repertoire of 
characters, so as to facilitate the round-trip conversion between the MARC-8 
and Unicode character sets [1,2].  At some point in the future the MARC-21 
repertoire will be decoupled from what was defined for MARC-8.

> For instance no linebreak...

Control characters such as line breaks are a bit of a different issue.  The 
MARC-21 standard currently allows for only a handful of control characters, not 
including (as you have discovered) the line break [3].

> This could be a really BIG trouble for kanjis or hindu languages imho.

The MARC-21 repertoire of characters includes East Asian Ideographs (Han), 
Japanese Hiranga and Katakana, and Korean Hangul [4,5].  I don't believe that 
Indic scripts in the vernacular would be valid MARC-21 characters. 

Are you finding any cases where the Marc::File::XML parser is dropping valid 
MARC-21 characters?

-- Michael

[1] USMARC Character Set Issues and Mapping to Unicode/UCS
    http://www.loc.gov/marc/marbi/1996/96-10.html

  WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM
  USMARC TO UNICODE/UCS

  The following Working Principles were established by the
  Subcommittee and continue to inform their mapping decisions:

   * Round-trip mapping will be provided between USMARC characters
     and Unicode/UCS characters wherever possible.

[2] MARC 21 Specifications > CHARACTER SETS: Part 2 UCS/Unicode Environment
    http://www.loc.gov/marc/specifications/speccharucs.html
    "The specifications are built around enabling round trip movement of MARC
     data between MARC-8 and UCS/Unicode with as little loss as possible."

[3]
  MARC-8  Unicode  Character
  ------  -------  ---------
   0x1B   U+001B   ESCAPE
   0x1D   U+001D   RECORD TERMINATOR 
   0x1E   U+001E   FIELD TERMINATOR 
   0x1F   U+001F   SUBFIELD DELIMITER 
   0x88   U+0098   NON-SORT BEGIN 
   0x89   U+009C   NON-SORT END 
   0x8D   U+200D   JOINER 
   0x8E   U+200C   NON-JOINER 

[4] MARC 21 Specifications > CHARACTER SETS: Part 3 Code Tables
    http://www.loc.gov/marc/specifications/specchartables.html

[5] MARC 21 Standard - UCS/Unicode Environment > Character Set Mappings 
    http://rocky.uta.edu/doran/charsets/marcU.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Henri-Damien LAURENT [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, September 26, 2007 10:45 AM
> To: perl4lib
> Subject: MARC::File::XML and parsing.
> 
> hi,
> I have some problems with Marc::File::XML parser.
> 
> Take those two xml records.
> Despite the fact that I agree that there are odd characters 
> in some subfields.
> I am wondering why, since those characters are UTF8, 
> MARC::File::XML should drop them when parsing.
> Is there a reason why MARC::File::XML considers only a very 
> strict subset of utf-8 as valid ? (For instance no linebreak, 
> no ...) ?
> 
> Couldnot it  say "OK It is XML record, encoded UTF8, i take 
> it for granted and no matter if there are "odd" characters" ?
> This could be a really BIG trouble for kanjis or hindu languages imho.
> 
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <record
>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>  xsi:schemaLocation="http://www.loc.gov/MARC21/slim 
> http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd"
>  xmlns="http://www.loc.gov/MARC21/slim";>
> 
>  <leader>00150nx  a2200073   4500 </leader>
>  <datafield tag="200" ind1=" " ind2="1">
>    <subfield code="a">Nicolas</subfield>
>    <subfield code="b">Jérôme</subfield>
>    <subfield code="4">Traducteur</subfield>  </datafield>  
> <datafield tag="100" ind1=" " ind2=" ">
>    <subfield code="a">19980124afrey50      ba0</subfield>
>  </datafield>
>  <controlfield tag="001">3568</controlfield>  <datafield 
> tag="152" ind1=" " ind2=" ">
>    <subfield code="b">NP</subfield>
>  </datafield>
> </record>
> <?xml version="1.0" encoding="UTF-8"?>
> <record
>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>  xsi:schemaLocation="http://www.loc.gov/MARC21/slim 
> http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd"
>  xmlns="http://www.loc.gov/MARC21/slim";>
> 
>  <leader>00151nx  a2200073   4500 </leader>
>  <datafield tag="200" ind1=" " ind2="1">
>    <subfield code="a">Guynemer</subfield>
>    <subfield code="b">Georges</subfield>
>    <subfield code="f">(1894-1917)</subfield>  </datafield>  
> <datafield tag="100" ind1=" " ind2=" ">
>    <subfield code="a">19980129afrey50      ba0</subfield>
>  </datafield>
>  <controlfield tag="001">4642</controlfield>  <datafield 
> tag="152" ind1=" " ind2=" ">
>    <subfield code="b">NP</subfield>
>  </datafield>
> </record>
> 
> --
> Henri Damien LAURENT et Paul POULAIN
> Consultants indépendants
> en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
> 
> 

Reply via email to