On 3/27/07, Zoltán Németh <[EMAIL PROTECTED]> wrote:
2007. 03. 27, kedd keltezéssel 15.06-kor William Lovaton ezt írta:
> Hi there,
>
> I am trying to implement language detection with PHP for a web site I am
> trying to build.  The idea is to take a piece of text and try to guess
> the language it is written in.
>
> I have two options but I'd like to know if you guys have a better idea.
>
> 1) I implemented a detector using spell checking, so if I run the text
> through many spell checkers the one with less errors is probably the
> right language for that text.  It works quite well and I am pleased with
> it.  The only thing I don't like is that loading many spell checkers is
> a bit of a waste, it may require a lot of CPU and a lot of memory
> depending on the dictionary and the number of dictionaries you load.
> Besides, it adds one extra module dependency (pspell).
>
> 2) The other option is implemented in PEAR and it's called
> Text_LanguageDetect:
> [] http://pear.php.net/package/Text_LanguageDetect
>
> It seems to use a very different technique called N-Gram-Based Text
> Categorization, I haven't tested it yet but I will very soon and see how
> good it works, it says it's in alpha state but I guess it doesn't
> requiere pspell, doesn't consume a lot of memory and it should be fast.
> The only thing I am worried about is how accurate is it... I'll check
> soon and post my comments later.
>
> 3) <Insert a very good idea here, please>
>
> I'd really like to hear what different alternatives all of you have for
> this problem.
>

I've definitely no experience with this problem, just guessing ;)

what if you build some arrays of language specific stuff and check for
that. I mean you could store stuff like "if it contains 's, 've, 'm many
times it's probably english"... I don't really know how to store those
rules, and I'm not sure they are good enough (or are there good enough
rules) to tell several languages apart...

greets
Zoltán Németh

In formal english, it's not allowed to use 've 'm etc, I'm should be
written as I am. So that's not gonna work i think.
But words like and are really english i think :)
Keep in mind that this is quite a hard way i think, but i don't have a
better solution.
Just for example, Dutch and Afrikaans are not very different, so it's
really hard to see which of the 2 the text is written in.

Tijnema

ps. If you can't get the difference between Dutch and Afrikaans, guess
for Dutch :) It's a lot more used then Afrikaans.


> Thanks a lot,
>
>
> -William
>

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to