I will try Lucene rather than hand-coding in order to keep the features like stemming, stop removal, fuzzy query, some european language support and so on. It may also be a starting point for using it more later on (I suspect this will happen if the initial try-out goes well).
I was expecting the Highlight package (org.apache.lucene.search.highlight) to be included in the jar file, but don't seem to see it (this is the 2.4.1 current version). Is this in a separate download somewhere? It includes classes Highlighter, QueryScorer (all listed at http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/highlight/package-summary.html) Ted Dunning wrote: > > You could turn this inside out and get the result you want, I think. > > If you index each document with separate Lucene fields for each document, > then you can start with a search for all documents that have the text you > want to find in the fields you care about. Then, all of the documents > that > you have will have the text you want. > > Alternately, if you have one or more documents and want to find out > whether > they have matches against particular fields, you can combine a search for > the strings you want in the fields you desire with a filter that limits > the > search to the documents in question. > > Mostly, however, it sounds like what you need is a bit different from what > Lucene is intended to provide which is the ability to search a gazillion > documents for text relevant to a pretty fuzzy query. With only one > document > and only a few fields to search, you might be just as well off coding the > search explicitly. Lucene could still serve as a nice substrate for > document storage. > > On Mon, May 11, 2009 at 9:49 AM, apgw <anth...@databaserepublic.com> > wrote: > >> >> The documents are text fields in a db of legal docs, so the search is not >> for >> the document but for the search string(s) (there will be multiple) within >> a >> given document. The search strings are manually derived from the main >> part, >> and they would like to match these in the law's various sub-sections >> automatically (legal docs - long, tedious...). >> >> I have more detailed questions (?should they be indexed when saved, or if >> this can be done quickly enough when the page is requested), and so on, >> but >> this is probably not the right forum; just need to know if Lucene will do >> it. I see there is a new edition of the 'Lucene in Action' almost ready; >> I >> have the first ed coming in the mail which I hope will help. >> >> >> Ted Dunning wrote: >> > >> > Yes. This can be done using Lucene. >> > >> > But, this is subject to a few liberal interpretations of what you asked >> > for. To wit, I am assuming that you want to find interesting documents >> > from >> > a bunch of documents, not just search a single document for matches. >> > >> > The span queries that another poster mentioned would be good as would >> > sloppy >> > phrase queries. >> > >> > Depending on which European languages you need to handle, there may be >> > some >> > work you need to do to deal with morphological analysis. Lucene has >> > reasonable support for English and somewhat more rudimentary support >> for >> a >> > few other European languages. Support for Asian languages is very >> basic >> > at >> > best. >> > >> > On Sun, May 10, 2009 at 7:43 PM, apgw <anth...@databaserepublic.com> >> > wrote: >> > >> >> >> >> I am new to Lucene. Is this the right utility to use for the following >> >> use >> >> case: >> >> >> >> 1) Find a search term - eg. 'lithium battery' in some technical >> rich-text >> >> data (can be in any european language), 4K - 64K size, and return the >> >> exact >> >> position in the text so that the occurrence can be turned into a >> >> hyperlink >> >> within the text, and the full text returned to the user with the >> embedded >> >> hyperlinks which he can select if he is interested. >> >> >> >> 2) Also find and hyperlink "lithium batteries", or "lithium hydride >> >> batteries" (with lower ranking) and so on. >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Lucene-rich-text-search-with-returned-hyperlinks-tp23476377p23487090.html >> Sent from the Lucene - General mailing list archive at Nabble.com. >> >> > > > -- > Ted Dunning, CTO > DeepDyve > > 111 West Evelyn Ave. Ste. 202 > Sunnyvale, CA 94086 > www.deepdyve.com > 858-414-0013 (m) > 408-773-0220 (fax) > > -- View this message in context: http://www.nabble.com/Lucene-rich-text-search-with-returned-hyperlinks-tp23476377p23676291.html Sent from the Lucene - General mailing list archive at Nabble.com.