I'm working on a linguistic module and I'm trying to find a good way to split a string up into "segments". I can't assume single charecter strings and want to assume maximal segments. As an example, the word "church" would be rendered as the list ('ch', 'u', 'r', 'ch') and wouldn't break the "ch" up smaller even though both "c" and "h" are valid segments in English. I have all the valid segments for a given language stored as keys in a hash, now I just need an algorithm to chop up a string into a list. Any ideas?
~wren
Why do you need to sort by alphabet after sorting by length? How about something like this:
#!/usr/bin/perl -w use strict;
my $text = "church"; my @alphabet = qw(c h ch u r);
# make a regex with many alterations (compiled only once) # that puts the longest characters first my $letters = join "|" => reverse sort @alphabet; my $regex = qr/($letters)/;
# then try to remove the first character from the string # using the regex (it matches the longest segments first) # and add it to @segments until our character doesn't # match anymore my @segments; push @segments, $1 while $text =~ s/^$regex//;
print join "-" => @segments;
Hope this helps,
matt diephouse ------------------------- http://matt.diephouse.com
