Re: [CODE4LIB] lingua::stem::snowball [resolved]
On Oct 12, 2009, at 10:27 PM, Benjamin Florin wrote: > foreach my $word (keys %words) > { > $words_stems{$stemmer->stem($word)} += $words{$word}; > } > > foreach my $idea (@ideas) > { > my $idea_stem = $stemmer->stem( $idea ); > print "$idea ($idea_stem)\n"; > print $words_stems{$idea_stem}."\n"; > } A number of other people in other venues have suggested similar things -- pre-process the hash of stems. I will go with this. Resolved. Thank you. internet++ -- Eric Morgan
Re: [CODE4LIB] lingua::stem::snowball
Presumably the call to stem() is the expensive part of your loop, so I'd want to cut that out if that is true. It looks to me that you can pass in an array reference to stem(), so there's no need for calling stem() in a loop at all. I'd think something like the code below should help reduce your calls to stem() to one call for the the idea and one call for the list of words. Note I used a sorted set of keys in order to assure that I keep the counts and the words that are stemmed in the same order when adding up the totals. The sort could be expensive too, so this may not work out better for you, depending on your input data and the performance of sort() and stem(). You could also use stem_in_place() if you don't want to make a copy of the array. Changing to use an array of @ideas instead of the scalar $idea would use an analogous technique. Matt use strict; use Lingua::Stem::Snowball; my $idea = 'books'; my %words = ( 'books'=> 5, 'library' => 6, 'librarianship' => 5, 'librarians'=> 3, 'librarian' => 3, 'book' => 3, 'museums' => 2 ); my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' ); my $idea_stem = $stemmer->stem( $idea ); print "$idea ($idea_stem)\n"; my @wordkeys = sort(keys(%words)); my @stemwords = $stemmer->stem( \...@wordkeys ); my $i = 0; my $total = 0; foreach my $word (@wordkeys) { if ( $idea_stem eq $stemwords[$i] ) { $total += $words{ $word } } $i++; } print "$total\n";
Re: [CODE4LIB] lingua::stem::snowball
It's been a while since I perled, so this might not be the most idiomatic solution, but you could stem the entire words has list once and create a hash of all the sums (%words_stems), then run the list of idea words (@ideas), checking only the desired stems: use strict; use Lingua::Stem::Snowball; my @ideas = ('books', 'otters', 'library'); my %words = ( 'books'=> 5, 'library' => 6, 'librarianship' => 5, 'librarians'=> 3, 'librarian' => 3, 'book' => 3, 'museums' => 2 ); my %words_stems = {}; my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' ); foreach my $word (keys %words) { $words_stems{$stemmer->stem($word)} += $words{$word}; } foreach my $idea (@ideas) { my $idea_stem = $stemmer->stem( $idea ); print "$idea ($idea_stem)\n"; print $words_stems{$idea_stem}."\n"; } The first foreach loop is executed once per word in %words, while the second foreach loop gets run once per item in @ideas. So 150,000 words with 1,000 ideas would only call the stem function (which is presumably where all the cost is) only 150,000 times. If you plan on doing something similar later, you could save that hash to disk, btw. Ben -- Benjamin Florin Technology Assistant for Blended Education Simmons College GSLIS 617-521-2842 benjamin.flo...@simmons.edu
[CODE4LIB] lingua::stem::snowball
Can someone help me use Lingua::Stem::Snowball more efficiently? I want to count the total number of times a word stem appears in a hash. Here is a short example: use strict; use Lingua::Stem::Snowball; my $idea = 'books'; my %words = ( 'books'=> 5, 'library' => 6, 'librarianship' => 5, 'librarians'=> 3, 'librarian' => 3, 'book' => 3, 'museums' => 2 ); my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' ); my $idea_stem = $stemmer->stem( $idea ); print "$idea ($idea_stem)\n"; my $total = 0; foreach my $word ( keys %words ) { my $word_stem = $stemmer->stem( $word ); print "\t$word ($word_stem)\n"; if ( $idea_stem eq $word_stem ) { $total += $words{ $word } } } print "$total\n"; In the end, the value of $total equals 8. That is, more or less, what I expect, but how can I make the foreach loop more efficient? In reality, my application fills %words up as many as 150,000 keys. Moreover, $idea is really just a single element in an array of about 100 words. Doing the math, the if statement in my foreach loop will get executed as many as 1,500,000 times. To make matters even worse, I plan to run the whole program about 10,000 times. That is a whole lot of processing just to count words! Is there someway I could short-circuit the foreach loop? I saw Lingua::Stem::Snowball's stem_in_place method, but to use it I must pass it an array disassociating my keys from their values. Second, is there a way I can make the stemming more aggressive? For example, I was hoping the stem of library would equal the stems of library, librarianship, and librarian, but alas, they don't. Any suggestions? -- Eric Lease Morgan