Yes, unique terms. I've started looking at the StandardAnalyzer, and
related classes, and I'll see if I can use them for what I want.
Also, I'd like massage the text based a bit more than just the unique
terms. For example, common words should be removed (some of which are
found in the StandardAnalyzer).
In addition, I'd like words to be modified a bit as well. For example, the
word 'application' should be changed to 'applic'. The word 'deploy' to
'deploi', 'deploying' to 'deploy', etc.
Is there an analyzer that does this? If not, how difficult would it be to
build an analyzer to do so?
Basically, I need to drastically cut down on the number of characters in
these documents before putting them into Oracle. The changes that I
mention above ('deploy' -> 'deploi', etc) all occur when using a C++
library from Apple that I believe was written by Doug Cutting when he was
in their Advanced Technology Group.
In fact, looking at some source code from that project, I just found three
interesting files: EnglishAbbreviations, EnglishStopwords,
EnglishSubstitutions, that I can probably use.
thanks,
rob
On Thu, 14 Mar 2002, Peter Carlson wrote:
> Hi
> I am a little confused by your request.
>
> When you say get the text that lucene would normally put into the index
> doesn't really make sense since lucene is term based.
>
> What data are you trying to get. The set of unique terms for each document?
> If you are trying to use lucene to normalize the data, you probably want to
> look into the analyzer.
>
> --Peter
>
>
> On 3/14/02 11:20 AM, "Robert A. Decker" <[EMAIL PROTECTED]> wrote:
>
> > I would like to use a very small part of the functionality of Lucene, but
> > need some pointers on which classes I should start looking at first.
> >
> > What I want to do is pass to a Lucene method some text, and have it return
> > the text that it would normally put into the index.
> >
> > (I'll then take that text and put it into an Oracle table)
> >
> > Really, that's all I want to do for now - I don't want anything to be
> > written to the filesystem. Just want to pass in text and get the modified
> > text back (assuming repetitive and common words have been removed).
> >
> > Can someone give me any pointers to get started? I've been slogging
> > through the code and the demos, but there's a lot there.
> >
> > thanks,
> > rob
> >
> >
> >
> > --
> > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
> >
> >
>
>
> --
> To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>