Re: Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-11 Thread Petr Gladkikh
On Mon, Aug 8, 2011 at 1:46 PM, Tuba Lambanog tuba.lamba...@gmail.com wrote:
 Hello,

 I’m doing a word stemmer for a non-English language. A stemmer parses
 a word into its word parts: prefixes, roots, suffixes. The input word
 is at least a root word (English example would be ‘cloud’), but can be
 any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
 root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
 sequence of more than one prefix in a word is considered one
 occurrence of a prefix, and similarly for complex prefixes, thus,
 ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
 prefixes, roots, and suffixes are in their own set data structure.

 The approach I am pursuing is to create a set of potential suffixes
 that the input word contains. Asssume, for simplicity, that the suffix
 set consists of #{-or, -er, -al, -ion, -ional, able}. The input
 ‘directional’ would have the candidate suffix set #{-al –ional}. Now,
 drop the longest suffix (‘ional’) from the input then check the
 remaining string (‘direct’) if it is a root; if it is, done. If not,
 try the next suffix (‘-al’) in the potential suffix set.  Prefixes
 will be similarly processed. Input words with both prefixes and
 affixes will be fun to do ;)

 I’m having a hard time thinking through the process of generating the
 candidate suffix set using set forms, and I’m beginning to think I
 have selected an arduous path (for me).

 Thoughts?


Somehow offtopic maybe, but have you looked at Snowball
http://snowball.tartarus.org/ ?
Algorithm is different but language that is used to describe stemmers
there is almost lisp and may be useful at least as reference.

-- 
Petr Gladkikh

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-11 Thread Tuba Lambanog
Hi, Petr,
Thank you for the pointer to the site. Indeed a treasure trove of ideas on
stemmer algorithms.
Tuba

On Thu, Aug 11, 2011 at 8:45 AM, Petr Gladkikh petrg...@gmail.com wrote:

 On Mon, Aug 8, 2011 at 1:46 PM, Tuba Lambanog tuba.lamba...@gmail.com
 wrote:
  Hello,
 
  I’m doing a word stemmer for a non-English language. A stemmer parses
  a word into its word parts: prefixes, roots, suffixes. The input word
  is at least a root word (English example would be ‘cloud’), but can be
  any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
  root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
  sequence of more than one prefix in a word is considered one
  occurrence of a prefix, and similarly for complex prefixes, thus,
  ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
  prefixes, roots, and suffixes are in their own set data structure.
 
  The approach I am pursuing is to create a set of potential suffixes
  that the input word contains. Asssume, for simplicity, that the suffix
  set consists of #{-or, -er, -al, -ion, -ional, able}. The input
  ‘directional’ would have the candidate suffix set #{-al –ional}. Now,
  drop the longest suffix (‘ional’) from the input then check the
  remaining string (‘direct’) if it is a root; if it is, done. If not,
  try the next suffix (‘-al’) in the potential suffix set.  Prefixes
  will be similarly processed. Input words with both prefixes and
  affixes will be fun to do ;)
 
  I’m having a hard time thinking through the process of generating the
  candidate suffix set using set forms, and I’m beginning to think I
  have selected an arduous path (for me).
 
  Thoughts?
 

 Somehow offtopic maybe, but have you looked at Snowball
 http://snowball.tartarus.org/ ?
 Algorithm is different but language that is used to describe stemmers
 there is almost lisp and may be useful at least as reference.

 --
 Petr Gladkikh

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-08 Thread Tuba Lambanog
Hello,

I’m doing a word stemmer for a non-English language. A stemmer parses
a word into its word parts: prefixes, roots, suffixes. The input word
is at least a root word (English example would be ‘cloud’), but can be
any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
sequence of more than one prefix in a word is considered one
occurrence of a prefix, and similarly for complex prefixes, thus,
‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
prefixes, roots, and suffixes are in their own set data structure.

The approach I am pursuing is to create a set of potential suffixes
that the input word contains. Asssume, for simplicity, that the suffix
set consists of #{-or, -er, -al, -ion, -ional, able}. The input
‘directional’ would have the candidate suffix set #{-al –ional}. Now,
drop the longest suffix (‘ional’) from the input then check the
remaining string (‘direct’) if it is a root; if it is, done. If not,
try the next suffix (‘-al’) in the potential suffix set.  Prefixes
will be similarly processed. Input words with both prefixes and
affixes will be fun to do ;)

I’m having a hard time thinking through the process of generating the
candidate suffix set using set forms, and I’m beginning to think I
have selected an arduous path (for me).

Thoughts?

Thanks.
Tuba

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-08 Thread Andreas Kostler
Hi Tuba,
I don't quite understand what you mean by I’m having a hard time
thinking through the process of generating the
candidate suffix set using set forms but I have created a porter
stemmer for English in the past.
I understand that's not what you're looking for but it is moreso a
framwork for building stemmers:

You specify rules of the like:
{:c? condition :s1 abc :s2 efg :a action}
reading if condition is met, replace s1 with s2 and execute action.
Where s1 could be a suffix etc. All you need to do is specify these rules.
Have a browse
https://github.com/AndreasKostler/Stout

Cheers
Andreas


On 8 August 2011 16:16, Tuba Lambanog tuba.lamba...@gmail.com wrote:

 Hello,

 I’m doing a word stemmer for a non-English language. A stemmer parses
 a word into its word parts: prefixes, roots, suffixes. The input word
 is at least a root word (English example would be ‘cloud’), but can be
 any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
 root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
 sequence of more than one prefix in a word is considered one
 occurrence of a prefix, and similarly for complex prefixes, thus,
 ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
 prefixes, roots, and suffixes are in their own set data structure.

 The approach I am pursuing is to create a set of potential suffixes
 that the input word contains. Asssume, for simplicity, that the suffix
 set consists of #{-or, -er, -al, -ion, -ional, able}. The input
 ‘directional’ would have the candidate suffix set #{-al –ional}. Now,
 drop the longest suffix (‘ional’) from the input then check the
 remaining string (‘direct’) if it is a root; if it is, done. If not,
 try the next suffix (‘-al’) in the potential suffix set.  Prefixes
 will be similarly processed. Input words with both prefixes and
 affixes will be fun to do ;)

 I’m having a hard time thinking through the process of generating the
 candidate suffix set using set forms, and I’m beginning to think I
 have selected an arduous path (for me).

 Thoughts?

 Thanks.
 Tuba

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-08 Thread Ken Wesson
On Mon, Aug 8, 2011 at 2:46 AM, Tuba Lambanog tuba.lamba...@gmail.com wrote:
 I’m having a hard time thinking through the process of generating the
 candidate suffix set using set forms, and I’m beginning to think I
 have selected an arduous path (for me).

 Thoughts?

Store the prefixes in a patricia tree, and the reversed suffixes in
another patricia tree. For suffixes, start at the end of the word and
walk backward while traversing the suffix tree until hitting a leaf.
Each node traversed (including the root, which is the empty string) is
a potential suffix and you traverse them in short-to-long order, so
reverse that to get them in long-to-short order. The case for prefixes
is analogous except you start at the start of the word and walk
forward while traversing the prefix tree. No suffix and No prefix
needn't be handled as special cases; they are just the empty string as
suffix or prefix, of length zero.

-- 
Protege: What is this seething mass of parentheses?!
Master: Your father's Lisp REPL. This is the language of a true
hacker. Not as clumsy or random as C++; a language for a more
civilized age.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-08 Thread Resty Cena
Hi, Andreas,

 I don't quite understand what you mean by I’m having a hard time
thinking through the process of generating the
candidate suffix set using set forms 

It is my usual roundabout way of saying I don't know how to do this. ;)

I'm looking at your code as we speak.

Thanks,
Tuba

On Mon, Aug 8, 2011 at 1:13 AM, Andreas Kostler 
andreas.koest...@leica-geosystems.com wrote:

 Hi Tuba,
 I don't quite understand what you mean by I’m having a hard time
 thinking through the process of generating the
 candidate suffix set using set forms but I have created a porter
 stemmer for English in the past.
 I understand that's not what you're looking for but it is moreso a
 framwork for building stemmers:

 You specify rules of the like:
 {:c? condition :s1 abc :s2 efg :a action}
 reading if condition is met, replace s1 with s2 and execute action.
 Where s1 could be a suffix etc. All you need to do is specify these rules.
 Have a browse
 https://github.com/AndreasKostler/Stout

 Cheers
 Andreas


 On 8 August 2011 16:16, Tuba Lambanog tuba.lamba...@gmail.com wrote:
 
  Hello,
 
  I’m doing a word stemmer for a non-English language. A stemmer parses
  a word into its word parts: prefixes, roots, suffixes. The input word
  is at least a root word (English example would be ‘cloud’), but can be
  any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
  root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
  sequence of more than one prefix in a word is considered one
  occurrence of a prefix, and similarly for complex prefixes, thus,
  ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
  prefixes, roots, and suffixes are in their own set data structure.
 
  The approach I am pursuing is to create a set of potential suffixes
  that the input word contains. Asssume, for simplicity, that the suffix
  set consists of #{-or, -er, -al, -ion, -ional, able}. The input
  ‘directional’ would have the candidate suffix set #{-al –ional}. Now,
  drop the longest suffix (‘ional’) from the input then check the
  remaining string (‘direct’) if it is a root; if it is, done. If not,
  try the next suffix (‘-al’) in the potential suffix set.  Prefixes
  will be similarly processed. Input words with both prefixes and
  affixes will be fun to do ;)
 
  I’m having a hard time thinking through the process of generating the
  candidate suffix set using set forms, and I’m beginning to think I
  have selected an arduous path (for me).
 
  Thoughts?
 
  Thanks.
  Tuba
 
  --
  You received this message because you are subscribed to the Google
  Groups Clojure group.
  To post to this group, send email to clojure@googlegroups.com
  Note that posts from new members are moderated - please be patient with
 your first post.
  To unsubscribe from this group, send email to
  clojure+unsubscr...@googlegroups.com
  For more options, visit this group at
  http://groups.google.com/group/clojure?hl=en

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-08 Thread Tuba Lambanog
Hi,
Thank you for the tip. It does look like the Patricia tree -- or suffix tree
-- is made-to-order for this kind of task. I'm reading up on it. Would there
be a Clojure implementation of this technology, I wonder.
Tuba

On Mon, Aug 8, 2011 at 1:40 AM, Ken Wesson kwess...@gmail.com wrote:

 On Mon, Aug 8, 2011 at 2:46 AM, Tuba Lambanog tuba.lamba...@gmail.com
 wrote:
  I’m having a hard time thinking through the process of generating the
  candidate suffix set using set forms, and I’m beginning to think I
  have selected an arduous path (for me).
 
  Thoughts?

 Store the prefixes in a patricia tree, and the reversed suffixes in
 another patricia tree. For suffixes, start at the end of the word and
 walk backward while traversing the suffix tree until hitting a leaf.
 Each node traversed (including the root, which is the empty string) is
 a potential suffix and you traverse them in short-to-long order, so
 reverse that to get them in long-to-short order. The case for prefixes
 is analogous except you start at the start of the word and walk
 forward while traversing the prefix tree. No suffix and No prefix
 needn't be handled as special cases; they are just the empty string as
 suffix or prefix, of length zero.

 --
 Protege: What is this seething mass of parentheses?!
 Master: Your father's Lisp REPL. This is the language of a true
 hacker. Not as clumsy or random as C++; a language for a more
 civilized age.

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-08 Thread Ken Wesson
On Mon, Aug 8, 2011 at 11:41 AM, Tuba Lambanog tuba.lamba...@gmail.com wrote:
 Hi,
 Thank you for the tip. It does look like the Patricia tree -- or suffix tree
 -- is made-to-order for this kind of task. I'm reading up on it.

You're welcome.

 Would there be a Clojure implementation of this technology, I wonder.

Even if not, it's probably trivial to slap one together, and test it,
in less than a day in Clojure.

As for generating your candidate seqs of prefixes and suffixes, just
cons onto an initial nil in your reduction* and you'll end up with a
seq that, traversed forwards, goes from longest candidate to shortest.
For suffixes you'll want to (map #(apply str (reverse %)) the-seq),
though, to get the suffixes the right way around (since they'll need
to be stored reversed in their tree).

-- 
Protege: What is this seething mass of parentheses?!
Master: Your father's Lisp REPL. This is the language of a true
hacker. Not as clumsy or random as C++; a language for a more
civilized age.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Elegant tools deserve elegant solutions. -- L. E. Gant

2011-08-08 Thread Resty Cena
Hi, Ken,

Thanks for the suggestion.

As I was looking at a suffix tree, it suddenly struck me that the following
strategy may do just as well:

1. Use rest and next to generate the tentative suffix sets, thus for
directional, it will give the set of #{irectional rectional ectional
ctional  tional ional onal nal al l}.

2. Intersection this with the set of suffixes.

3. Select the longest item in the result set.

I'm trying this one out now.

Tuba


On Mon, Aug 8, 2011 at 11:23 AM, Ken Wesson kwess...@gmail.com wrote:

 On Mon, Aug 8, 2011 at 11:41 AM, Tuba Lambanog tuba.lamba...@gmail.com
 wrote:
  Hi,
  Thank you for the tip. It does look like the Patricia tree -- or suffix
 tree
  -- is made-to-order for this kind of task. I'm reading up on it.

 You're welcome.

  Would there be a Clojure implementation of this technology, I wonder.

 Even if not, it's probably trivial to slap one together, and test it,
 in less than a day in Clojure.

 As for generating your candidate seqs of prefixes and suffixes, just
 cons onto an initial nil in your reduction* and you'll end up with a
 seq that, traversed forwards, goes from longest candidate to shortest.
 For suffixes you'll want to (map #(apply str (reverse %)) the-seq),
 though, to get the suffixes the right way around (since they'll need
 to be stored reversed in their tree).

 --
 Protege: What is this seething mass of parentheses?!
 Master: Your father's Lisp REPL. This is the language of a true
 hacker. Not as clumsy or random as C++; a language for a more
 civilized age.

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en