Re: [CODE4LIB] find more like this one

2005-05-24 Thread Mike Rylander
On 5/24/05, Eric Lease Morgan [EMAIL PROTECTED] wrote:
 On May 23, 2005, at 6:27 PM, Steven C. Perkins wrote:

  I did a search on indigenous.  The first item was a French article.
  The display of diacritics was messed up.  I added French to the
  languages in IE, but the display was still bad.  I don't know if this
  is a WinXP problem or a problem with your page.  I did not see a
  language encoding on your source.  Perhaps UTF-8 will fix this?  Or it
  may be a problem from the document retrieved.

 Yes, I do not know how to handle the extended ASCII characters, and I
 hoping someone here can point me in the right direction.

 As I said earlier, I use Net::OAI::Harvester to... harvest the data. I
 use MyLibrary to save the data to a MySQL database. I then write
 reports against the database in the form of a simple XML stream and
 feed the stream to swish-e for indexing. I know swish-e is unable to
 index multi-byte characters, and search results come directly from
 swish-e, not MyLibrary.


Will swish-e index the actual bytes of non-diacritic multibyte
characters?  If so, you can do what we do with Open-ILS (we use
Postgres' tsearch2 fulltest indexing module).  When indexing data, we
strip it of diacritical combining characters using 's/\p{M}//go'.
When a search is submitted we do the same thing, because a linked
search may contain the diacritics, or the searching user may be typing
in a non-US locale.  This will search the simplified strings and does
the right thing, at least with our data.  We display the original
document (or a portion thereof) so that multibyte characters are
displayed.

For scripts that are entirely outside ASCII (Arabic, Kanji, etc) we
just index and search using the original bytes because they are not
matched by /\p{M}/.  In our testing this seems to work fine (of
course, we'd appreciate any tips on making this smarter).

 Maybe I should draw search results from MyLibrary and not swish-e to
 display characters correctly? If I draw content from many global
 sources, then how do I know what character set to use for display?


This is definitely the best thing to do.  Search the normallized data
and display the original.  Also, if you store the documents UTF-8
encoded you won't need to worry about the character set, you just need
to set the encoding for the page to UTF-8 and the browser will take
care of the rest.

--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org


Re: [CODE4LIB] find more like this one

2005-05-24 Thread Binkley, Peter
Eric and Mike wrote:

  Maybe I should draw search results from MyLibrary and not
 swish-e to
  display characters correctly? If I draw content from many global
  sources, then how do I know what character set to use for display?
 

 This is definitely the best thing to do.  Search the
 normallized data and display the original.  Also, if you
 store the documents UTF-8 encoded you won't need to worry
 about the character set, you just need to set the encoding
 for the page to UTF-8 and the browser will take care of the rest.


Bear in mind that even in UTF-8 there is more than one way to encode an
accented character. It can be precomposed (using a single character,
e.g. U0089 for lower-case e-acute: this is normalization form C) or
decomposed (using a base character and a non-spacing diacritic, e.g.
U0065 and U0301, lower-case e plus the acute accent: this is
normalization form D). If you're searching at the byte level, you have
to be sure that your index and your search term have been normalized the
same way or they won't match. I've found this FAQ useful for this stuff:
http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
(including stripping accents and normalizing case for different scripts)
for indexing and searching in UTF-8. There's also a C API, which could
presumably be incorporated into a Perl process, but no doubt there are
similar native Perl tools.

In general I think we've got to include i18n from the beginning: pay
attention to character sets of incoming data, normalize as early in the
process as possible (especially if ANSEL is involved!), use
UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
(this site helps with the html:
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
not as easy as it ought to be but at least there are good open-source
tools out there.

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]


Re: [CODE4LIB] find more like this one

2005-05-24 Thread Andrew Nagy

Binkley, Peter wrote:


Bear in mind that even in UTF-8 there is more than one way to encode an
accented character. It can be precomposed (using a single character,
e.g. U0089 for lower-case e-acute: this is normalization form C) or
decomposed (using a base character and a non-spacing diacritic, e.g.
U0065 and U0301, lower-case e plus the acute accent: this is
normalization form D). If you're searching at the byte level, you have
to be sure that your index and your search term have been normalized the
same way or they won't match. I've found this FAQ useful for this stuff:
http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
(including stripping accents and normalizing case for different scripts)
for indexing and searching in UTF-8. There's also a C API, which could
presumably be incorporated into a Perl process, but no doubt there are
similar native Perl tools.

In general I think we've got to include i18n from the beginning: pay
attention to character sets of incoming data, normalize as early in the
process as possible (especially if ANSEL is involved!), use
UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
(this site helps with the html:
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
not as easy as it ought to be but at least there are good open-source
tools out there.



Wow, it looks like there are some unicode experts at our midst.  I am in
the middle of developing an international bibliographic database where
most of the titles are in languages other than EN-US.

Our database will store citations entered in via a web form since the
bibliography is in card format.  I am using MySQL 4 because of the
unicode support and collations.  I normally use postgres, but I figured
for a database that will mainly be used for searching only (very little
writes after the data has been populated) i'd give MySQL a try.

One feature we would like to offer is searching via the collations.  For
example, if I enter the phrase francais, i would hope that any items
with the term français would result.  Is it correct to use MySQL's
collations for this?  Does anyone have experience with this?

I am still learning the uses of UTF-8 characters, so I am glad there are
so many of you who know so much about this on this list!

Andrew