Happy new year! I'm working on a way to simple geocode documents as they are indexed. I'm hoping to use existing Lucene infrastructure to do this as much as possible. My plan is to build an index of known place names then look for matches in incoming text. When there is a match, some extra fields will get added to the index.
The known place list will include things like: * The People's Republic of China * Rome * New York I want to match documents where this phrase (normalized for capitalization/punctuation/etc) appears in the document. It looks like MemoryIndex was made to do something like this: Create a MemoryIndex for each item you want to match, then run the document against each possible value and see if it matches. Without testing this approach, it seems kinda crazy if we have ~100K+ placenames. I am also concerned how this would work with long phrases and things that may match with "The Peoples Republic of *" Just brainstorming, it seems like an FST could be a good/efficient way to match documents. My plan would be to: 1. Use an Analyzer to create a TokenStream for each place name. From the TokenStream create an FST<docid> -- this would have to pick some impossible character for the token seperator. 2. While indexing, create a TokenStream from the input text. For each token, try to follow the Arc to a match. If there is a match, add it to the document. Does this approach seem reasonable? Is there some standard way to do this that I am missing? thanks for any pointers! ryan The two approaches I am considering: 1. MemoryIndex -- build a MemoryIndex for each place name. Check every index 2. FST -- Use an Analyzer to get a TokenStream for each input name and build an FST<docid> based on the input. Then analyze the text while indexing and use the TokenStream to --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org