You can try indexing all 2-grams, 3-grams, and 4-grams in your corpus. Then you can examine all the terms in your index and see which n-grams are used the most.
On 9/12/05, Wilkerson, Cory <[EMAIL PROTECTED]> wrote: > > So...I've had good/great luck finding all terms in my index using the > Lucene API - life is good. Now - I'm trying to take things a step > further and find sequences of key words (maybe two/three/four word > combinations). It's great that I can find "new" and "orleans", but I'm > mostly interested in articles that contain "new orleans". I realize I > can *search* for these terms but I'm more interested in writing an > engine that says "Hey, these sequences seem to be fairly important > because they're occurring quite a bit across this index." > > Any suggestions? > Cory Wilkerson > -- Andy Liu [EMAIL PROTECTED] (301) 873-8458
