The BoW approach is simple and highly effective IMO. If you want to get a bit fancy, you could also use a MultiField query in the combined index.
Another brute-force approach would be to hit all 3 indexes and see which ones come back with the highest score(s). On Mon, Mar 9, 2009 at 8:43 AM, Erick Erickson <erickerick...@gmail.com>wrote: > Sure, Lucene is suited. If.... > > The central problem here isn't the search engine, IMO, it's > figuring out what bits of the query are relevant to what > parts of the data. That is, in some random string, what is > the street, business name, address, etc. > > Lucene has nothing built in that I know of that'll help with > that part. Once you *have* figured out what parts of the > query relate to what fields in your index, the rest is easy. > But you'll have to do the figuring out yourself. > > But you might try the bagowords I suggested before as > a shortcut and see what kind of results you get. Sometimes > simplistic solutions are "good enough", but that's always > up to you to decide once you start seeing results. > > Best > Erick > > On Mon, Mar 9, 2009 at 4:31 AM, Srinivas Bharghav > <srini.bharg...@gmail.com>wrote: > > > Thanks for all the inputs guys. > > > > As Erick said let me elaborate the problem a bit. > > > > We are trying to develop a local search application. The user will be > able > > to locate businesses, localities and roads. We have data for all the 3 > with > > us. We do not want to provide separate boxes for the user to enter data > i.e > > a common one for all entry box (a la google :)) where the user enters an > > address (or road name or area name) or all the 3 etc etc. From the user > > query we have to find the best possible match in our data. The data has > > lots > > of numbers as well as names with initials and stuff like that. The user > may > > enter the names with a space between the initals or they might club the > > initials together etc etc. From the user query we do not have a way to > > figure out what is what apart from the obvious ones as to if something > ends > > with a road then it is a road name or if there is a layout in the query > > then > > it is an area etc. Right now we have our own custom framework. I am > trying > > to figure out as to whether Lucene is suited for this kind of > application. > > > > Once again thanks for all the inputs. > > > > On Fri, Mar 6, 2009 at 7:15 PM, Erick Erickson <erickerick...@gmail.com > > >wrote: > > > > > Whatever you do will be wrong <G>. What you're saying is > > > that you have structured data that the user wants to search > > > in an unstructured way, and you want to try to create a > > > system that intuits what the user meant. Good luck <G>. > > > > > > Can you back up a bit and talk about the problem you're > > > trying to solve? If, for instance, you're trying to find the > > > best match for a particular business, one approach would > > > be to create one index where each business had > > > > > > street > > > business > > > area > > > bagowords > > > > > > where the field bagowords contained a copy of the data > > > from the other three fields, then search bagowords > > > for your query terms. It sounds simplistic, but it might be > > > surprisingly good. > > > > > > And if this is out in left field, a higher level statement > > > of the problem would help get better answers. > > > > > > Best > > > Erick > > > > > > On Fri, Mar 6, 2009 at 1:25 AM, Srinivas Bharghav > > > <srini.bharg...@gmail.com>wrote: > > > > > > > I am trying to evaluate as to whether Lucene is the right candidate > for > > > the > > > > problem at hand. > > > > > > > > Say I have 3 indexes: > > > > > > > > Index 1 has street names. > > > > Index 2 has business names. > > > > Index 3 has area names. > > > > > > > > All these names can be single words or a combination of words like > > > woodward > > > > street or marks and spencers street etc etc. > > > > > > > > Now the use enters a query saying "mc donalds woodward street > kingston > > > > precinct". > > > > > > > > I have to parse this query and come up with the best match possible. > > The > > > > problem is, in the query I do not know which part is the business > name > > or > > > > area name or street name. Also the user may give the query in any > order > > > for > > > > example he may give it as "kingston precinct mc donalds woodward > > street". > > > > There might be spelling mistkaes in the query enterd by the user. > Also > > he > > > > might use road for street or lane for street and such things. I know > > that > > > > Lucene is the right candidate for the synonym and spelling mistakes > > part > > > > but > > > > am a bit hazy regarding the user query parsing part as to in which > > index > > > to > > > > search what. Any help is greatly appreciated. > > > > > > > > Thanks, > > > > Srini. > > > > > > > > > >