Re: tricky parsing question

Chris Devers Fri, 23 Jan 2004 09:54:31 -0800

On Thu, 22 Jan 2004, wren argetlahm wrote:

> --- Rick Measham <[EMAIL PROTECTED]> wrote:
> > Wren, when you say 'segments' it appears you
> > mean phonemes or phonetics.
>
> Yeah, I do mean phonemes (or something like it). The
> module is language independent, but I'll check those
> modules out.

That's probably a good approach. Even if one of the existing linguistics
modules isn't quite right, if it's good OO code you should be able to
sub-class it to add the functionality you need.

It seems like there are a lot of subtleties to be considered; standing on
the shoulders of others who have worked on this may save a lot of pain.

> --- Chris Devers <[EMAIL PROTECTED]> wrote:
> > Your definition of "segment" here is vague; is
> > it safe to ignore that and just accept that a
> > canonical list of each language's 'segments' is
> > a static thing that is already stored as hash
> > keys?
>
> By "segment" I mean the smallest charecter or sequence
> of charecters that has a regular pronunciation. But
> yes, it's safe to ignore that and assume there's a
> canonical list of "segments" already in memory.

Do you need to handle ambiguities? For example, "-ough" can famously be
pronounced several ways:

    bough -> 'bow'
    cough -> 'koff'
    dough -> 'doh'
    rough -> 'ruff'
    tough -> 'tuff'

And my copy of /usr/share/dict/words also has words I don't know:

    hough -> 'hock'  [seems to be synonymous with 'hock', meaning
                         a bone joint, used by butchers for pork]
    jough -> I can't find a definition or pronounciation
    lough -> 'lock'  [same as 'loch', as in 'Loch Ness Monster']
    sough -> 'soh'   [synonymous with 'field that is farmed' or 'groan']
    wough -> I can't find a definition or pronunciation

So that gives at least four or five ways to say '-ough', and maybe more.

Does your code need to handle such things? Or can we, again, assume that
this has been swept under the carpet of predetermined lists?

> I am indeed associating the segments with values,
> hence storing them as keys in a hash.

Okay, but I still think that attacking this problem will be easier if you
start out with these elements in a normal, hand-ordered list, and then
pre-populate the keys of one or more hashes based on that. So, making this
up and not intending this to be a perfect or complete approach to things:

    my ( @en_segs, @fr_segs, @de_segs, [....],
         %en_hash, %fr_hash, %de_hash, [....] );

    # populate the arrays with predetermined lists
    @en_segs = qw [           # English
        ough ious ion [....]     #  four & three letter segments
        ch sh th [....]          #  two letter segments
        x y z                    #  one letter segments
    ];
    @fr_segs = qw[ [....] ];  # French, repeat as above with English
    @de_segs = qw[ [....] ];  # German, and so on

    # convert those lists to hashes
    # probably a more idiomatic way to do this, but whatever:
    foreach $key @en_segs {
        %en_hash{$_} = "";
    }

    # this might be the more idiomatic way? this is untested...
    %fr_hash{$_} = "" foreach @fr_segs;
    %de_hash{$_} = "" foreach @de_segs; # etc for other languages

At that point, you've got the data stored twice, and can begin working:

    my ( $string, $max_seg, $offset, $cur_str );
    $string = 'supercalafragalisticixpyaladocious';
    $max_seg = length( $en_segs[0] );
        # ^ because the array is hand sorted, you know that the
        #   first element will always have the longest segment
    $offset = 0;

    $cur_str = substr( $string, $offset, $max_seg);

(If it's not obvious, I'm making this up as I type -- improvements are
welcome.) From this point, you need to "walk the string", getting
substrings into $cur_str, then looking for the longest part of that
substring that exists in @en_segs, then incrementing $offset based on the
longest match you found, moving to the next $cur_str, and repeat until the
string is exhausted. Obviously, the last line or two up there ought to be
wrapped in a while loop or something to make this work.

I have a feeling that grep may help with the array lookups, but I can't
think of how to phrase the line[s] that would do it.

I have a feeling that decomposing the string into an array of characters
might help (maybe with grep, etc), but then I have a feeling that doing
that would be treating this too much like a C program, and Perl shouldn't
have to stoop to parsing strings the way C does.

I still have a feeling that Parse::RecDescent would make all of this a lot
easier, but I'm not the one to walk you through using that module. This
really does seem like the sort of problem that RecDescent is best at
though, so it's worth looking up some of Damian Conway's documentation for
the module. If you can get your head around it, it's probably *way* more
effective than most any other approach anyone could suggest.

-- 
Chris Devers

Re: tricky parsing question

Reply via email to