Newbie: Human Readable Stemming, Lucene Architecture, etc!

Owen Densmore Thu, 20 Jan 2005 13:51:08 -0800

Hi .. I'm new to the list so forgive a dumb question or two as I get started.

We're in the midst of converting a small collection (1200-1500 currently) of scientific literature to be easily searchable/navigable. We'll likely provide both a text query interface as well as a graphical way to search and discover.

Our initial approach will be vector based, looking at Latent Semantic Indexing (LSI) as a potential tool, although if that's not needed, we'll stop at reasonably simple stemming with a weighted document term matrix (DTM). (Bear in mind I couldn't even pronounce most of these concepts last week, so go easy if I'm incoherent!)

It looks to me that Lucene has a quite well factored architecture. I should at the very least be able to use the analyzer and stemmer to create a good starting point in the project. I'd also like to leave a nice architecture behind in case we or others end up experimenting with, or extending, the system.

So a couple of questions:

1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating -> generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed?

2 - We're probably using Lucene in ways it was not designed for, such as DTM/LSI and graphical clustering and navigation. Naturally we'll provide code for these parts that are not in Lucene. But the question arises: is this kinda dumb?! Has anyone stretched Lucene's design center with positive results? Are we barking up the wrong tree?

3 - A nit on hyphenation: Our collection is scientific so has many hyphenated words. I'm wondering about your experiences with hyphenation. In our collection, things like self-organization, power-law, space-time, small-world, agent-based, etc. occur often, for example. So the question is: Do folks break up hyphenated words? If not, do you stem the parts and glue them back together? Do you apply stoplists to the parts?

Thanks for any help and pointers you can fling along,

Owen    http://backspaces.net/    http://redfish.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Newbie: Human Readable Stemming, Lucene Architecture, etc!

Reply via email to