Erik, thank you very much for your help! I am not in the position to build the indexing (other features are in line before that), yet, but I will try Lucene for it. Looks very good :)
What I did not ask the last time, because it just occurred to me, was: I have in my application a metric of "fit" to know how much a term compares to a document, and also how much the term compares to the whole collection. Seems to be a good candidate for your "boost" value, then? Cheers, Daniel Erik Hatcher schrieb: > > On Apr 26, 2005, at 3:21 AM, Daniel Stephan wrote: > >> lets see if somebody listens on this list :-D > > > I doubt many are on this list, yet. But your question is probably > best asked on the [EMAIL PROTECTED] list rather than here. I'll CC > java-user this time to loop those folks in. > >> I wonder if the following is possible with Lucene. > > > Yes, it is! > >> I would like to add documents to the index, which aren't real documents. > > > That's pretty much how my use of Lucene works - there aren't real > filesystem documents to index per se. > >> :-) Meaning: there is no text to parse and tokenize. What I have is a >> number of features, some are simple words, some are combinations of >> words. Those features classify an entity in my database. >> >> I have also an own parser/analyzer/tokenizer which is able to take a >> text and extract those features from it. (Possibly a query.) >> >> So, I wanna do sth like (pseudo code): >> >> Lucene.index(myEntity.getId, myEntity.getDescriptors) >> >> and then when a query was issued: >> >> List entityIds = Lucene.query(myQuery.convertToLuceneQueryLanguage) >> >> I was looking at the source and couldn't find a possibility to get rid >> of the analyzing stage to hand the features to Lucene myself. > > > You have two good options here.... you can add each token individually > as a Field.Keyword() with the same field name - these will not get > analyzed. > >> One possibility would be to use an analyzer which only considers >> whitespace as delimiter and set all descriptors as one string. This >> feels suboptimal, because I have them as single tokens already and >> concatenating them first, to let lucene tokenize them again, should not >> be necessary. > > > You're right, it's not necessary. The second option is to create a > custom Analyzer that returns the tokens you've already established. > >> Also, the neighbour information isn't applicable in my scenario. It >> seems you use placement of terms somehow. I don't have placement >> information. Would that hurt Lucene? > > > Placement of terms is used in phrase queries, but certainly isn't a > necessity that you concern yourself with it. You can simply emit > tokens in whatever position you like (leaving the default position > increment to 1 is what I'd recommend). > >> I am not sure how Lucenes uses the placement information, but in the >> described case where I concatenate all my features to a >> whitespace-delimited text, I fear that Lucene uses the placement of >> features in this made-up text and comes to some wrong conclusions (after >> all, the placement is arbitrary in the "made-up" text). > > > What wrong conclusions do you fear here? Again, the position > information is used for phrase queries, but in your situation you > wouldn't be using phrase queries so no need to concern yourself with > the position stuff at all. > >> Also 2, I am not sure yet, how the converter would have to look like. >> After all the terms in the query have to be of the same form as those in >> the index, otherwise they wouldn't match. Can I inject my own analyzer >> only for the query part, so that lucene hands it phrases and lets it >> build features from those phrases? > > > Sure - you can use any analyzer you like for query parsing. It sounds > like you aren't going to use QueryParser, though, so you may not need > an analyzer at all. You definitely have to ensure that the terms in > the query match the terms you indexed in order to find documents. How > you do this is really up to you. > >> Any info is appreciated. I could maybe build my own simple index, the >> analyzer is already there, but I would prefer to use a professional >> solution with a good query language and some additional >> nice-to-have-features. > > > If you could give us something more concrete, we could help in more > detail. But from the scenario you've described, Lucene fits fine and > in fact describes the way I use it in some cases. > >> May I use Lucene? :-) > > > Yes, you may! > > Erik > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]