On 9/28/09 7:07 AM, "Merlin Morgenstern" <merli...@fastmail.fm> wrote:

> Ashley Sheridan wrote:
>> On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote:
>>> Hi there,
>>> I am trying to find out similarity between 2 strings. Somehow the
>>> similar_text function returns 33% similarity on strings that are not
>>> even close and on the other hand it returns 21% on strings that have a
>>> matching word.
>>> E.G:
>>> 'gemütliche sofas'
>>> Wohngemeinschaften - similarity: 33.333333333333
>>> Sofas & Sessel - similarity: 31.25
>>> I am using this code:
>>> similar_text($data[txt], $categories[$i], $similarity);
>>> Does anybody have an idea why it gives back 33% similarity on the first
>>> string?
>>> Thank you for any help,
>>> Merlin
>> If you think about it, it makes sense.
>> Taking your three sentences above, 'Wohngemeinschaften' has more
>> characters similar towards the start of the string (you only have to go
>> 4 characters in to start a match) whereas 'sofas' won't match the source
>> string until the 12th string in. Also, both test strings have the same
>> number of characters that match in order, although the ones that match
>> in 'Wohngemeinschaften' are separated by characters that do not match,
>> so I'm not sure what bearing this will have.
>> As noted on the manual page for this function, the similar_text()
>> function compares without regard to string length, and tends to only
>> really be accurate enough for larger excerpts of text.
>> Thanks,
>> Ash
>> http://www.ashleysheridan.co.uk
> Sounds logical. Is there another function you suggest? I guess this is a
> standard problem I am having here. I tried it with levenstein, but
> similar results.
> e.g levenstein (smaller = better):
> Search for : Stellplatz fÃ1Ž4r Wohnwagen gesucht
> Stereoanlagen : 23
> Wohnwagen, -mobile : 24
> Sonstiges fÃ1Ž4r Baby & Kind - : 25
> Steuer & Finanzen - :25
> How come stereoanlagen and the others shows up here?
> Any idea how I could make this more accurate?
> Thank you for any help, Merlin

as ashley pointed out, it's not a trivial problem.

if you are performing the tests against strings in a db table then a full
text index might help. see, e.g.:

you could also check out the php sphinx client

if you are writing your own solutions and using utf8, take care with
similar_text() or levenshtein(). i don't think they are designed for
multibyte strings. so if you are using utf8 they will probably report bigger
differences that you might expect. i wrote my own limited
damerau-levenshtein function for utf8.

even if you're using a single byte encoding, i would guess they ignore a
locale's collation. so say you set a german locale, ü will be regarded as
different from both u and ue. again, if you are searching against against
strings in a db table, the dbms may understand collations properly.

PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to