Re: tricky parsing question

Matt Diephouse Thu, 12 Feb 2004 13:23:18 -0800

Wren Argetlahm wrote:

I'm working on a linguistic module and I'm trying to
find a good way to split a string up into "segments".
I can't assume single charecter strings and want to
assume maximal segments. As an example, the word
"church" would be rendered as the list ('ch', 'u',
'r', 'ch') and wouldn't break the "ch" up smaller even
though both "c" and "h" are valid segments in English.
I have all the valid segments for a given language
stored as keys in a hash, now I just need an algorithm
to chop up a string into a list. Any ideas?

~wren

Why do you need to sort by alphabet after sorting by length? How about something like this:

#!/usr/bin/perl -w
use strict;

my $text = "church";
my @alphabet = qw(c h ch u r);

# make a regex with many alterations (compiled only once)
# that puts the longest characters first
my $letters = join "|" => reverse sort @alphabet;
my $regex = qr/($letters)/;

# then try to remove the first character from the string
# using the regex (it matches the longest segments first)
# and add it to @segments until our character doesn't
# match anymore
my @segments;
push @segments, $1 while $text =~ s/^$regex//;

print join "-" => @segments;

Hope this helps,

matt diephouse
-------------------------
http://matt.diephouse.com

Re: tricky parsing question

Reply via email to