How should this be done (the translation, that is)? If it were left as '<' and '>', would Lucene parse it properly?
Terry ----- Original Message ----- From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, October 21, 2002 5:40 PM Subject: Re: Tags Screwing up Searches > Thanks for the update. > This all sounds right (no bugs). The problem is the code that you have > that translates those < and > characters. > > Otis > > --- Terry Steichen <[EMAIL PROTECTED]> wrote: > > Otis, > > > > I discovered that the actual text that I was dealing with already > > converted > > the '<' converted to '<', and so forth. So the problem is that > > with > > something like '<b>College Soccer</b>', Lucene recognizes > > the > > trailing semi-colon ';' as a word separator, so it can find the term > > 'college', but it does not see the ending of 'soccer'. I did confirm > > that > > it *will* match on 'soccer<' just fine. > > > > I've proceeded to add a string substitution method which replaces > > '<' > > with ' ' (four spaces, in order to hopefully keep the offsets > > straight). > > It appears to work, though I believe it slows down the indexing. > > > > I don't know enough about the inner design of Lucene to figure this > > out, but > > it seems logical that there would be a much more efficient way to > > handle > > this than string operations. > > > > Anyway, thought I'd bring you up to date. > > > > Regards, > > > > Terry > > > > PS: I've had no responses from the list, so perhaps this is a unique > > problem > > and doesn't justify a formal fix effort. > > > > ----- Original Message ----- > > From: "Terry Steichen" <[EMAIL PROTECTED]> > > To: "Lucene Users Group" <[EMAIL PROTECTED]> > > Sent: Friday, October 18, 2002 11:39 AM > > Subject: Tags Screwing up Searches > > > > > > Some content I'm indexing contains certain HTML tags, like <p>, <b>, > > <i>, > > etc. What I find is that when a term I'm searching for touches one > > of these > > tags (which is fairly typical), the term isn't recognized and the > > search > > fails. For example, <b>College Soccer</b> doesn't match on either > > "college" > > or "soccer". I seem to recall someone else bring up a similar > > problem with > > a word that ends a sentence (and is thus treated as if the period was > > part > > of the word), but don't recall what the response was and I can't find > > that > > thread. > > > > Does anyone have some ideas on what's the best way to handle this? > > Filter > > out the tags in the process of creating the Document for indexing? Or > > through a modification to the Analyzer (I'm using the > > StandardAnalyzer)? Or > > something else? > > > > TIA, > > > > Terry > > > > > > > > > > -- > > To unsubscribe, e-mail: > > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > > For additional commands, e-mail: > > <mailto:lucene-user-help@;jakarta.apache.org> > > > > > __________________________________________________ > Do you Yahoo!? > Y! Web Hosting - Let the expert host your web site > http://webhosting.yahoo.com/ > > -- > To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org> > > -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>
