Re: [OT] Determining the similarity between a pair of texts

2005-06-15 Thread Stefano Mazzocchi
Ugo Cei wrote:
> Il giorno 15/giu/05, alle 18:27, Stefano Mazzocchi ha scritto:
> 
>> I've been working on this for the past few months. There is no clearcut
>> solution, but using LSI is probably the best approach for the above
> 
> 
> LSI == ?

latent semantic indexing

>> As for string distance, you might want to check out  secondstring.sf.net.
> 
> There are lots of algorith there 
>  package-summary.html>. Which one would you suggest before I start 
> trying them all one by one?

it really depends on what you need to do. A simple first order
Levenstein can make you go a long way, if you don't know the
distribution of the frequencies of the characters in advance. Otherwise,
 you get better results, but at the price of being less effective if the
statistical properties of your data streams start changing over time.

-- 
Stefano.



Re: [OT] Determining the similarity between a pair of texts

2005-06-15 Thread Ugo Cei

Il giorno 15/giu/05, alle 18:27, Stefano Mazzocchi ha scritto:


I've been working on this for the past few months. There is no clearcut
solution, but using LSI is probably the best approach for the above


LSI == ?



As for string distance, you might want to check out  
secondstring.sf.net.


There are lots of algorith there  
. Which one would you suggest before I start  
trying them all one by one?


Ugo

--
Ugo Cei
Tech Blog: http://agylen.com/
Open Source Zone: http://oszone.org/
Wine & Food Blog: http://www.divinocibo.it/


smime.p7s
Description: S/MIME cryptographic signature


Re: [OT] Determining the similarity between a pair of texts

2005-06-15 Thread Stefano Mazzocchi
Peter Hunsberger wrote:
> On 6/15/05, Ugo Cei <[EMAIL PROTECTED]> wrote:
> 
>>Il giorno 15/giu/05, alle 16:32, Tony Collen ha scritto:
> 
> 
> 
> 
>>Actually, what I am trying to come up is an algorithm for determining
>>whether two texts refer (more or less) about similar subjects.
>>
> 
> Eee, then you may have to jump into the NLP stuff and Ontology
> processing in general; non-trivial stuff...

I've been working on this for the past few months. There is no clearcut
solution, but using LSI is probably the best approach for the above
(even if there is tons of literature on variations of approaches).

As for string distance, you might want to check out secondstring.sf.net.

-- 
Stefano.



Re: [OT] Determining the similarity between a pair of texts

2005-06-15 Thread Peter Hunsberger
On 6/15/05, Ugo Cei <[EMAIL PROTECTED]> wrote:
> Il giorno 15/giu/05, alle 16:32, Tony Collen ha scritto:


> 
> Actually, what I am trying to come up is an algorithm for determining
> whether two texts refer (more or less) about similar subjects.
> 
Eee, then you may have to jump into the NLP stuff and Ontology
processing in general; non-trivial stuff...

-- 
Peter Hunsberger


Re: [OT] Determining the similarity between a pair of texts

2005-06-15 Thread Ugo Cei

Il giorno 15/giu/05, alle 16:32, Tony Collen ha scritto:


Ugo,

I think what you're looking for is the Levenshtein Distance Algorithm.

http://www.google.com/search? 
hl=en&q=java+Levenshtein+implementation&btnG=Google+Search


Nice! I also found an implementation nearby:

http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/ 
StringUtils.html#getLevenshteinDistance(java.lang.String,%20java.lang.St 
ring)


;)

However, this algorithm is useful for finding single-character  
differences, whereas I am more interested in word differences. IOW, the  
LD between "test" and "tent" is 1 and the LD between "test" and "barf"  
is 4, but for my purpose it should be 1 in both cases. And the LD  
between "test case" and "tent base" is smaller than the one between  
"test case" and "case under test", but I need it to be the reverse.


Actually, what I am trying to come up is an algorithm for determining  
whether two texts refer (more or less) about similar subjects.


Ugo

--
Ugo Cei
Tech Blog: http://agylen.com/
Open Source Zone: http://oszone.org/
Wine & Food Blog: http://www.divinocibo.it/



Re: [OT] Determining the similarity between a pair of texts

2005-06-15 Thread Torsten Curdt
>>Excuse the Off-Topic, but I'm looking for a Java API for determining
>>the degree of similarity (based on word frequency or whatever) between
>>two text strings.

also commons codec has some algorithms
...depends on what you are after exactly

http://jakarta.apache.org/commons/codec/

cheers
--
Torsten


signature.asc
Description: OpenPGP digital signature


Re: [OT] Determining the similarity between a pair of texts

2005-06-15 Thread Peter Hunsberger
On 6/15/05, Ugo Cei <[EMAIL PROTECTED]> wrote:
> Excuse the Off-Topic, but I'm looking for a Java API for determining
> the degree of similarity (based on word frequency or whatever) between
> two text strings.
> 
> I though of posting here since I know there are some people here expert
> in semantic web technologies that could maybe help me.
> 
> Thanks in advance and sorry again for the OT.
> 

Probably not exacftly what you want but you can find a collection of
Natural Language Processing (NLP) toools at:

 http://opennlp.sourceforge.net/projects.html

If you've got to build your own NLP algorithims this might help...

-- 
Peter Hunsberger (not an expert in semantic web technologies...)


Re: [OT] Determining the similarity between a pair of texts

2005-06-15 Thread Tony Collen

Ugo,

I think what you're looking for is the Levenshtein Distance Algorithm.

http://www.google.com/search?hl=en&q=java+Levenshtein+implementation&btnG=Google+Search

HTH,

Tony

Ugo Cei wrote:
Excuse the Off-Topic, but I'm looking for a Java API for determining the 
degree of similarity (based on word frequency or whatever) between two 
text strings.


I though of posting here since I know there are some people here expert 
in semantic web technologies that could maybe help me.


Thanks in advance and sorry again for the OT.

Ugo