Re: UNICODE character identification

George Milten Tue, 10 Feb 2015 07:25:14 -0800

looks good, though i guess it is a deprecated module,

thank you though for the info, will further investigate towards the machine
learning process, but i guess my use case is simpler: Check if a character
belongs to a certain set = language, and see if it is odd, based on the
language of the word


2015-02-10 17:17 GMT+02:00 Kool,Wouter <wouter.k...@oclc.org>:

>  You might also take a machine learning approach, like Naïve Bayesian
> Classification. For instance
> http://search.cpan.org/~kwilliams/Algorithm-NaiveBayes-0.04/lib/Algorithm/NaiveBayes.pm.
> You build test sets from records in various scripts and use the classifier
> to find hybrid cases.  I have quite satisfactory results with this approach
> in a slightly different use case.
>
>
>
>
>
> *From:* George Milten [mailto:george.mil...@gmail.com]
> *Sent:* dinsdag 10 februari 2015 16:09
>
> *To:* Kool,Wouter
> *Cc:* perl4lib@perl.org
> *Subject:* Re: UNICODE character identification
>
>
>
> yes probably this is where i was also heading, but thought there was a
> more clever way. Also, is there a good perl normaliser? I have not had any
> experience with:
>
>
>
> http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm
>
>
>
> For starters if i could spot only the odd letters between latin and greek
> regex character classes, i would be more than happy
>
>
>
> 2015-02-10 17:04 GMT+02:00 Kool,Wouter <wouter.k...@oclc.org>:
>
> Apologies, I missed the subject line...
>
> Then you might use the regex character classes. For instance $text =~ 
> m/\p{Hiragana}/;  matches any Japanese Hiragana character. I have not tested 
> it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you 
> find the character class that most characters match and you look for the 
> exceptions. Would that help?
>
>
>
>
>
>
>
> *From:* George Milten [mailto:george.mil...@gmail.com]
> *Sent:* dinsdag 10 februari 2015 15:56
> *To:* Kool,Wouter
> *Cc:* perl4lib@perl.org
> *Subject:* Re: UNICODE character identification
>
>
>
> utf-8,
>
>
>
> thank you
>
>
>
> 2015-02-10 16:54 GMT+02:00 Kool,Wouter <wouter.k...@oclc.org>:
>
> What encoding is your data in? utf8? Single-byte encoding? Marc8? That
> information matters a lot to determine whether your idea would work. If it
> is in a single-byte encoding there is often no way to determine the script
> the character belongs to.
>
>
>
>
>
> *Wouter Kool*
> Metadata Specialist *·* OCLC B.V.
> Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands
> t +31-(0)71-524 6500
>
> wouter.k...@oclc.org *·* www.oclc.org
>
> [image: Volg @OCLC_NL op Twitter] <https://twitter.com/OCLC_NL>*[image:
> Volg OCLC (Nederland) op LinkedIn]*
> <https://www.linkedin.com/company/oclc-nederland->*[image: Abonneer op
> OCLCVideo]*
> <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO>
>
> *[image:
> https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C000000227Uz&oid=00D80000000ZRv8&lastMod=1409843680000]*
> <http://www.oclc.org/>
>
>
>
>
>
>
>
>
>
>
>
> *From:* George Milten [mailto:george.mil...@gmail.com]
> *Sent:* dinsdag 10 februari 2015 13:27
> *To:* perl4lib@perl.org
> *Subject:* UNICODE character identification
>
>
>
> Hello friendly folks,
>
>
>
> follows what i am trying to do, and i am looking for your help in order to
> find the most clever way to achieve this:
>
>
>
> We have records, that include typos like this: we have a word say Plato,
> where the last o is inputted with the keyboard set to Greek language, so we
> need something that would parse all metadata in a per character basis,
> check against what is the script language that the majority of characters
> the word belongs to have, and return the odd characters, the script they
> belong, and the record identifier they were found in, so as to be able to
> correct them
>
>
>
> thank you in advance
>
>
>
>
>

Re: UNICODE character identification

Reply via email to