RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
What encoding is your data in? utf8? Single-byte encoding? Marc8? That 
information matters a lot to determine whether your idea would work. If it is 
in a single-byte encoding there is often no way to determine the script the 
character belongs to.


Wouter Kool
Metadata Specialist · OCLC B.V.
Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands
t +31-(0)71-524 6500
wouter.k...@oclc.orgmailto:wouter.k...@oclc.org · 
www.oclc.orghttp://www.oclc.org/
[Volg @OCLC_NL op Twitter]https://twitter.com/OCLC_NL [Volg OCLC (Nederland) 
op LinkedIn] https://www.linkedin.com/company/oclc-nederland- [Abonneer op 
OCLCVideo] 
https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO
[https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uzoid=00D8000ZRv8lastMod=140984368]http://www.oclc.org/





From: George Milten [mailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 13:27
To: perl4lib@perl.org
Subject: UNICODE character identification

Hello friendly folks,

follows what i am trying to do, and i am looking for your help in order to find 
the most clever way to achieve this:

We have records, that include typos like this: we have a word say Plato, where 
the last o is inputted with the keyboard set to Greek language, so we need 
something that would parse all metadata in a per character basis, check against 
what is the script language that the majority of characters the word belongs to 
have, and return the odd characters, the script they belong, and the record 
identifier they were found in, so as to be able to correct them

thank you in advance


Re: UNICODE character identification

2015-02-10 Thread George Milten
yes probably this is where i was also heading, but thought there was a more
clever way. Also, is there a good perl normaliser? I have not had any
experience with:

http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm

For starters if i could spot only the odd letters between latin and greek
regex character classes, i would be more than happy

2015-02-10 17:04 GMT+02:00 Kool,Wouter wouter.k...@oclc.org:

  Apologies, I missed the subject line...

 Then you might use the regex character classes. For instance $text =~ 
 m/\p{Hiragana}/;  matches any Japanese Hiragana character. I have not tested 
 it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you 
 find the character class that most characters match and you look for the 
 exceptions. Would that help?







 *From:* George Milten [mailto:george.mil...@gmail.com]
 *Sent:* dinsdag 10 februari 2015 15:56
 *To:* Kool,Wouter
 *Cc:* perl4lib@perl.org
 *Subject:* Re: UNICODE character identification



 utf-8,



 thank you



 2015-02-10 16:54 GMT+02:00 Kool,Wouter wouter.k...@oclc.org:

 What encoding is your data in? utf8? Single-byte encoding? Marc8? That
 information matters a lot to determine whether your idea would work. If it
 is in a single-byte encoding there is often no way to determine the script
 the character belongs to.





 *Wouter Kool*
 Metadata Specialist *·* OCLC B.V.
 Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands
 t +31-(0)71-524 6500

 wouter.k...@oclc.org *·* www.oclc.org

 [image: Volg @OCLC_NL op Twitter] https://twitter.com/OCLC_NL*[image:
 Volg OCLC (Nederland) op LinkedIn]*
 https://www.linkedin.com/company/oclc-nederland-*[image: Abonneer op
 OCLCVideo]*
 https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO

 *[image:
 https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uzoid=00D8000ZRv8lastMod=140984368]*
 http://www.oclc.org/











 *From:* George Milten [mailto:george.mil...@gmail.com]
 *Sent:* dinsdag 10 februari 2015 13:27
 *To:* perl4lib@perl.org
 *Subject:* UNICODE character identification



 Hello friendly folks,



 follows what i am trying to do, and i am looking for your help in order to
 find the most clever way to achieve this:



 We have records, that include typos like this: we have a word say Plato,
 where the last o is inputted with the keyboard set to Greek language, so we
 need something that would parse all metadata in a per character basis,
 check against what is the script language that the majority of characters
 the word belongs to have, and return the odd characters, the script they
 belong, and the record identifier they were found in, so as to be able to
 correct them



 thank you in advance





Re: UNICODE character identification

2015-02-10 Thread George Milten
looks good, though i guess it is a deprecated module,

thank you though for the info, will further investigate towards the machine
learning process, but i guess my use case is simpler: Check if a character
belongs to a certain set = language, and see if it is odd, based on the
language of the word

2015-02-10 17:17 GMT+02:00 Kool,Wouter wouter.k...@oclc.org:

  You might also take a machine learning approach, like Naïve Bayesian
 Classification. For instance
 http://search.cpan.org/~kwilliams/Algorithm-NaiveBayes-0.04/lib/Algorithm/NaiveBayes.pm.
 You build test sets from records in various scripts and use the classifier
 to find hybrid cases.  I have quite satisfactory results with this approach
 in a slightly different use case.





 *From:* George Milten [mailto:george.mil...@gmail.com]
 *Sent:* dinsdag 10 februari 2015 16:09

 *To:* Kool,Wouter
 *Cc:* perl4lib@perl.org
 *Subject:* Re: UNICODE character identification



 yes probably this is where i was also heading, but thought there was a
 more clever way. Also, is there a good perl normaliser? I have not had any
 experience with:



 http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm



 For starters if i could spot only the odd letters between latin and greek
 regex character classes, i would be more than happy



 2015-02-10 17:04 GMT+02:00 Kool,Wouter wouter.k...@oclc.org:

 Apologies, I missed the subject line...

 Then you might use the regex character classes. For instance $text =~ 
 m/\p{Hiragana}/;  matches any Japanese Hiragana character. I have not tested 
 it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you 
 find the character class that most characters match and you look for the 
 exceptions. Would that help?







 *From:* George Milten [mailto:george.mil...@gmail.com]
 *Sent:* dinsdag 10 februari 2015 15:56
 *To:* Kool,Wouter
 *Cc:* perl4lib@perl.org
 *Subject:* Re: UNICODE character identification



 utf-8,



 thank you



 2015-02-10 16:54 GMT+02:00 Kool,Wouter wouter.k...@oclc.org:

 What encoding is your data in? utf8? Single-byte encoding? Marc8? That
 information matters a lot to determine whether your idea would work. If it
 is in a single-byte encoding there is often no way to determine the script
 the character belongs to.





 *Wouter Kool*
 Metadata Specialist *·* OCLC B.V.
 Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands
 t +31-(0)71-524 6500

 wouter.k...@oclc.org *·* www.oclc.org

 [image: Volg @OCLC_NL op Twitter] https://twitter.com/OCLC_NL*[image:
 Volg OCLC (Nederland) op LinkedIn]*
 https://www.linkedin.com/company/oclc-nederland-*[image: Abonneer op
 OCLCVideo]*
 https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO

 *[image:
 https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uzoid=00D8000ZRv8lastMod=140984368]*
 http://www.oclc.org/











 *From:* George Milten [mailto:george.mil...@gmail.com]
 *Sent:* dinsdag 10 februari 2015 13:27
 *To:* perl4lib@perl.org
 *Subject:* UNICODE character identification



 Hello friendly folks,



 follows what i am trying to do, and i am looking for your help in order to
 find the most clever way to achieve this:



 We have records, that include typos like this: we have a word say Plato,
 where the last o is inputted with the keyboard set to Greek language, so we
 need something that would parse all metadata in a per character basis,
 check against what is the script language that the majority of characters
 the word belongs to have, and return the odd characters, the script they
 belong, and the record identifier they were found in, so as to be able to
 correct them



 thank you in advance







Re: UNICODE character identification

2015-02-10 Thread George Milten
utf-8,

thank you

2015-02-10 16:54 GMT+02:00 Kool,Wouter wouter.k...@oclc.org:

  What encoding is your data in? utf8? Single-byte encoding? Marc8? That
 information matters a lot to determine whether your idea would work. If it
 is in a single-byte encoding there is often no way to determine the script
 the character belongs to.





 *Wouter Kool*
 Metadata Specialist *·* OCLC B.V.
 Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands
 t +31-(0)71-524 6500

 wouter.k...@oclc.org *·* www.oclc.org

 [image: Volg @OCLC_NL op Twitter] https://twitter.com/OCLC_NL *[image:
 Volg OCLC (Nederland) op LinkedIn]*
 https://www.linkedin.com/company/oclc-nederland-*[image: Abonneer op
 OCLCVideo]*
 https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO

 *[image:
 https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uzoid=00D8000ZRv8lastMod=140984368]*
 http://www.oclc.org/











 *From:* George Milten [mailto:george.mil...@gmail.com]
 *Sent:* dinsdag 10 februari 2015 13:27
 *To:* perl4lib@perl.org
 *Subject:* UNICODE character identification



 Hello friendly folks,



 follows what i am trying to do, and i am looking for your help in order to
 find the most clever way to achieve this:



 We have records, that include typos like this: we have a word say Plato,
 where the last o is inputted with the keyboard set to Greek language, so we
 need something that would parse all metadata in a per character basis,
 check against what is the script language that the majority of characters
 the word belongs to have, and return the odd characters, the script they
 belong, and the record identifier they were found in, so as to be able to
 correct them



 thank you in advance



RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
Apologies, I missed the subject line...

Then you might use the regex character classes. For instance $text =~ 
m/\p{Hiragana}/;  matches any Japanese Hiragana character. I have not tested 
it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you 
find the character class that most characters match and you look for the 
exceptions. Would that help?




From: George Milten [mailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 15:56
To: Kool,Wouter
Cc: perl4lib@perl.org
Subject: Re: UNICODE character identification

utf-8,

thank you

2015-02-10 16:54 GMT+02:00 Kool,Wouter 
wouter.k...@oclc.orgmailto:wouter.k...@oclc.org:
What encoding is your data in? utf8? Single-byte encoding? Marc8? That 
information matters a lot to determine whether your idea would work. If it is 
in a single-byte encoding there is often no way to determine the script the 
character belongs to.


Wouter Kool
Metadata Specialist · OCLC B.V.
Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands
t +31-(0)71-524 6500tel:%2B31-%280%2971-524%206500
wouter.k...@oclc.orgmailto:wouter.k...@oclc.org · 
www.oclc.orghttp://www.oclc.org/
[Volg @OCLC_NL op Twitter]https://twitter.com/OCLC_NL[Volg OCLC (Nederland) 
op LinkedIn]https://www.linkedin.com/company/oclc-nederland-[Abonneer op 
OCLCVideo]https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO
[https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uzoid=00D8000ZRv8lastMod=140984368]http://www.oclc.org/





From: George Milten 
[mailto:george.mil...@gmail.commailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 13:27
To: perl4lib@perl.orgmailto:perl4lib@perl.org
Subject: UNICODE character identification

Hello friendly folks,

follows what i am trying to do, and i am looking for your help in order to find 
the most clever way to achieve this:

We have records, that include typos like this: we have a word say Plato, where 
the last o is inputted with the keyboard set to Greek language, so we need 
something that would parse all metadata in a per character basis, check against 
what is the script language that the majority of characters the word belongs to 
have, and return the odd characters, the script they belong, and the record 
identifier they were found in, so as to be able to correct them

thank you in advance