Re: [CODE4LIB] lingua::stem::snowball [resolved]

2009-10-13 Thread Eric Lease Morgan
On Oct 12, 2009, at 10:27 PM, Benjamin Florin wrote:

> foreach my $word (keys %words)
> {
>   $words_stems{$stemmer->stem($word)} += $words{$word};
> }
>
> foreach my $idea (@ideas)
> {
>   my $idea_stem = $stemmer->stem( $idea );
>   print "$idea ($idea_stem)\n";
>   print $words_stems{$idea_stem}."\n";
> }


A number of other people in other venues have suggested similar things  
-- pre-process the hash of stems. I will go with this. Resolved. Thank  
you.

  internet++

-- 
Eric Morgan


Re: [CODE4LIB] lingua::stem::snowball

2009-10-12 Thread Matt Jones
Presumably the call to stem() is the expensive part of your loop, so I'd
want to cut that out if that is true. It looks to me that you can pass in an
array reference to stem(), so there's no need for calling stem() in a loop
at all.   I'd think something like the code below should help reduce your
calls to stem() to one call for the the idea and one call for the list of
words. Note I used a sorted set of keys in order to assure that I keep the
counts and the words that are stemmed in the same order when adding up the
totals.  The sort could be expensive too, so this may not work out better
for you, depending on your input data and the performance of sort() and
stem(). You could also use stem_in_place() if you don't want to make a copy
of the array.  Changing to use an array of @ideas instead of the scalar
$idea would use an analogous technique.

Matt

use strict;
use Lingua::Stem::Snowball;
my $idea  = 'books';
my %words = ( 'books'=> 5,
 'library'   => 6,
 'librarianship' => 5,
 'librarians'=> 3,
 'librarian' => 3,
 'book'  => 3,
 'museums'   => 2
   );
my $stemmer   = Lingua::Stem::Snowball->new( lang => 'en' );
my $idea_stem = $stemmer->stem( $idea );
print "$idea ($idea_stem)\n";
my @wordkeys = sort(keys(%words));
my @stemwords = $stemmer->stem( \...@wordkeys );
my $i = 0;
my $total = 0;
foreach my $word (@wordkeys) {
if ( $idea_stem eq $stemwords[$i] ) { $total += $words{ $word } }
$i++;
}
print "$total\n";


Re: [CODE4LIB] lingua::stem::snowball

2009-10-12 Thread Benjamin Florin
It's been a while since I perled, so this might not be the most
idiomatic solution, but you could stem the entire words has list once
and create a hash of all the sums (%words_stems), then run the list of
idea words (@ideas), checking only the desired stems:

use strict;
use Lingua::Stem::Snowball;
my @ideas  = ('books', 'otters', 'library');
my %words = ( 'books'=> 5,
 'library'   => 6,
 'librarianship' => 5,
 'librarians'=> 3,
 'librarian' => 3,
 'book'  => 3,
 'museums'   => 2
   );
my %words_stems = {};
my $stemmer   = Lingua::Stem::Snowball->new( lang => 'en' );

foreach my $word (keys %words)
{
$words_stems{$stemmer->stem($word)} += $words{$word};
}

foreach my $idea (@ideas)
{
my $idea_stem = $stemmer->stem( $idea );
print "$idea ($idea_stem)\n";
print $words_stems{$idea_stem}."\n";
}

The first foreach loop is executed once per word in %words, while the
second foreach loop gets run once per item in @ideas. So 150,000 words
with 1,000 ideas would only call the stem function (which is
presumably where all the cost is) only 150,000 times.

If you plan on doing something similar later, you could save that hash
to disk, btw.

Ben

-- 
Benjamin Florin
Technology Assistant for Blended Education
Simmons College GSLIS
617-521-2842
benjamin.flo...@simmons.edu


[CODE4LIB] lingua::stem::snowball

2009-10-12 Thread Eric Lease Morgan
Can someone help me use Lingua::Stem::Snowball more efficiently?

I want to count the total number of times a word stem appears in a  
hash. Here is a short example:


use strict;
use Lingua::Stem::Snowball;
my $idea  = 'books';
my %words = ( 'books'=> 5,
  'library'   => 6,
  'librarianship' => 5,
  'librarians'=> 3,
  'librarian' => 3,
  'book'  => 3,
  'museums'   => 2
);
my $stemmer   = Lingua::Stem::Snowball->new( lang => 'en' );
my $idea_stem = $stemmer->stem( $idea );
print "$idea ($idea_stem)\n";
my $total = 0;
foreach my $word ( keys %words ) {
  my $word_stem = $stemmer->stem( $word );
  print "\t$word ($word_stem)\n";
  if ( $idea_stem eq $word_stem ) { $total += $words{ $word } }
}
print "$total\n";


In the end, the value of $total equals 8. That is, more or less, what  
I expect, but how can I make the foreach loop more efficient? In  
reality, my application fills %words up as many as 150,000 keys.  
Moreover, $idea is really just a single element in an array of about  
100 words. Doing the math, the if statement in my foreach loop will  
get executed as many as 1,500,000 times. To make matters even worse, I  
plan to run the whole program about 10,000 times. That is a whole lot  
of processing just to count words!

Is there someway I could short-circuit the foreach loop? I saw  
Lingua::Stem::Snowball's stem_in_place method, but to use it I must  
pass it an array disassociating my keys from their values.

Second, is there a way I can make the stemming more aggressive? For  
example, I was hoping the stem of library would equal the stems of  
library, librarianship, and librarian, but alas, they don't.

Any suggestions?

-- 
Eric Lease Morgan