Jason Kohles schrieb:
On Mar 2, 2007, at 10:42 AM, Mario Minati wrote:
Hello @all,
I'm looking for a solution to find out if there is already some data
in my dataset that is similar to a new entry.
Example:
Companynames
I would like to find out if there are already companies in my
addressbook (DB) which are similar to a given name to avoid double
entries.
How to measure similarity:
I'am thinking of the hammingdistance. That means the difference
between Linux and Linus is 1 as there is one letter different. The
distance between Linux and Lisa is 3 as there is one letter more and
two are different.
Does anyone have an idea how to realize that?
Can one realize this with code running on the database (PL/SQL or
something) or is there a way doing that with DBIx::Class (drawback:
all data had to read before processing).
Many databases have either built-in or extensions available to do
this. The only ones that I have used are from the PostgreSQL contrib
module fuzzystrmatch, which provides functions for
calculating Levenshtein Distance and Soundex (possibly others, I can't
recall).
If your database doesn't support it directly, for some solutions
(Soundex, Metaphone) where you are calculating a checksum of the value
and then looking for ones where the checksum is the same, you can
precalculate the value when updating the database, and then just do a
simple comparison search. Something like this would probably work...
package MyDB::Company;
use base 'DBIx::Class';
use Text::Metaphone qw();
__PACKAGE__->add_columns( qw( company company_metaphone ) );
sub calculate_company_metaphone {
my ( $self, $value ) = @_;
my $co = lc( $value );
$co =~ s/[^\w\s]//g;
$co =~ s/\b(incorporated|inc|corporation|corp)\b/INC/;
$co =~ s/\b(company|co)\b/CO/;
# probably some other useful standardization stuff here
return Text::Metaphone::Metaphone( $co );
}
sub store_column {
my ( $self, $name, $value ) = @_;
if ( $name eq 'company' ) {
$self->company_metaphone( $self->calculate_comany_metaphone( $value ) );
}
$self->next::method( $name, $value );
}
Then you can simply search something like...
my $rs = $schema->resultset( 'Company' )->search( {
company_metaphone => MyDB::Company->calculate_company_metaphone(
$company_name ),
} );
Hello Jason,
your idea of an adapted store_column function is very clever.
But in the meantime I did some more googlesports and found out that the
soundex functions (Metaphone is one of them) was build for english
language and that there are some serious implications with the german
language. As I am living in germany ...
So I thought of useing the Levenshtein algorithm which cannot be
precalculated. So I'll do it in the database. I found out that there is
an addon for Levenshtein and it was pretty easy to install.
No I've just to create a stored procedure which takes the company names
and split's them at whitespaces and calculates the Levenshtein
difference to a given word and returns the lowest value per company
name. This way I'll get a good value, as the 'string' is an ajax user input.
Once again thank you for the code sample, I'll use it soon for an other
problem.
Greets,
Mario
_______________________________________________
List: http://lists.rawmode.org/cgi-bin/mailman/listinfo/dbix-class
Wiki: http://dbix-class.shadowcatsystems.co.uk/
IRC: irc.perl.org#dbix-class
SVN: http://dev.catalyst.perl.org/repos/bast/trunk/DBIx-Class/
Searchable Archive: http://www.mail-archive.com/[email protected]/