Re: Indexing of virtual "made up" documents

Daniel Stephan Sat, 30 Apr 2005 04:01:51 -0700

Erik, thank you very much for your help!

I am not in the position to build the indexing (other features are in
line before that), yet, but I will try Lucene for it. Looks very good :)


What I did not ask the last time, because it just occurred to me, was: I
have in my application a metric of "fit" to know how much a term
compares to a document, and also how much the term compares to the whole
collection. Seems to be a good candidate for your "boost" value, then?

Cheers,
Daniel


Erik Hatcher schrieb:

>
> On Apr 26, 2005, at 3:21 AM, Daniel Stephan wrote:
>
>> lets see if somebody listens on this list :-D
>
>
> I doubt many are on this list, yet.  But your question is probably
> best asked on the [EMAIL PROTECTED] list rather than here.  I'll CC
> java-user this time to loop those folks in.
>
>> I wonder if the following is possible with Lucene.
>
>
> Yes, it is!
>
>> I would like to add documents to the index, which aren't real documents.
>
>
> That's pretty much how my use of Lucene works - there aren't real
> filesystem documents to index per se.
>
>> :-) Meaning: there is no text to parse and tokenize. What I have is a
>> number of features, some are simple words, some are combinations of
>> words. Those features classify an entity in my database.
>>
>> I have also an own parser/analyzer/tokenizer which is able to take a
>> text and extract those features from it. (Possibly a query.)
>>
>> So, I wanna do sth like (pseudo code):
>>
>>    Lucene.index(myEntity.getId, myEntity.getDescriptors)
>>
>> and then when a query was issued:
>>
>>    List entityIds = Lucene.query(myQuery.convertToLuceneQueryLanguage)
>>
>> I was looking at the source and couldn't find a possibility to get rid
>> of the analyzing stage to hand the features to Lucene myself.
>
>
> You have two good options here.... you can add each token individually
> as a Field.Keyword() with the same field name - these will not get
> analyzed.
>
>> One possibility would be to use an analyzer which only considers
>> whitespace as delimiter and set all descriptors as one string. This
>> feels suboptimal, because I have them as single tokens already and
>> concatenating them first, to let lucene tokenize them again,  should not
>> be necessary.
>
>
> You're right, it's not necessary.  The second option is to create a
> custom Analyzer that returns the tokens you've already established.
>
>> Also, the neighbour information isn't applicable in my scenario. It
>> seems you use placement of terms somehow. I don't have placement
>> information. Would that hurt Lucene?
>
>
> Placement of terms is used in phrase queries, but certainly isn't a
> necessity that you concern yourself with it.  You can simply emit
> tokens in whatever position you like (leaving the default position
> increment to 1 is what I'd recommend).
>
>> I am not sure how Lucenes uses the placement information, but in the
>> described case where I concatenate all my features to a
>> whitespace-delimited text, I fear that Lucene uses the placement of
>> features in this made-up text and comes to some wrong conclusions (after
>> all, the placement is arbitrary in the "made-up" text).
>
>
> What wrong conclusions do you fear here?  Again, the position
> information is used for phrase queries, but in your situation you
> wouldn't be using phrase queries so no need to concern yourself with
> the position stuff at all.
>
>> Also 2, I am not sure yet, how the converter would have to look like.
>> After all the terms in the query have to be of the same form as those in
>> the index, otherwise they wouldn't match. Can I inject my own analyzer
>> only for the query part, so that lucene hands it phrases and lets it
>> build features from those phrases?
>
>
> Sure - you can use any analyzer you like for query parsing.  It sounds
> like you aren't going to use QueryParser, though, so you may not need
> an analyzer at all.  You definitely have to ensure that the terms in
> the query match the terms you indexed in order to find documents.  How
> you do this is really up to you.
>
>> Any info is appreciated. I could maybe build my own simple index, the
>> analyzer is already there, but I would prefer to use a professional
>> solution with a good query language and some additional
>> nice-to-have-features.
>
>
> If you could give us something more concrete, we could help in more
> detail.  But from the scenario you've described, Lucene fits fine and
> in fact describes the way I use it in some cases.
>
>> May I use Lucene? :-)
>
>
> Yes, you may!
>
>     Erik
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing of virtual "made up" documents

Reply via email to