"AA-" indexed as a StringField was matched by a TermQuery for "AA"? Sounds surprising.
-- Ian. On Tue, Feb 12, 2013 at 6:32 PM, Mohammad Tariq <donta...@gmail.com> wrote: > Thanks again Ian. I'll make the changes suggested by you. And I am using > dots because if I search for 'AA' it was giving me 'AA-' as well. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Tue, Feb 12, 2013 at 9:50 PM, Ian Lea <ian....@gmail.com> wrote: > >> From a glance it looks fine. I don't see what you gain by adding dots >> - you are using a TermQuery which will only do exact matches. Since >> you're using StringField your text won't be tokenized but stored as >> is. I see you're searching on a mixed case term - that's fine as long >> as you don't expect "aaa" to match "AAA". I tend to just downcase >> everything because I've wasted so much time over the years on silly >> case sensitive bugs. >> >> RAMDirectory instances will disappear when the application ends so >> yes, you'll need to reload on startup. You don't have to recreate for >> each search though - create and populate the RAMDirectory on startup >> and create an IndexSearcher and use that for all searches. >> >> Depending on your app it might be easier to use a normal disk based >> index. It will probably be fast enough. >> >> >> -- >> Ian. >> >> >> On Tue, Feb 12, 2013 at 1:29 PM, Mohammad Tariq <donta...@gmail.com> >> wrote: >> > Hello Ian, >> > * >> > * >> > I started as directed by you and created the index. Here is a small >> > piece of code which I have written. Please have a look over it : >> > * >> > * >> > *public static void main(String[] args) throws IOException, >> ParseException { >> > * >> > * * >> > * //Specify the analyzer for tokenizing text. The same analyzer should >> > be used for indexing and searching* >> > * StandardAnalyzer analyzer = new >> StandardAnalyzer(Version.LUCENE_40);* >> > * >> > * >> > * // 1. create the index* >> > * Directory index = new RAMDirectory();* >> > * IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, >> > analyzer);* >> > * IndexWriter w = new IndexWriter(index, config);* >> > * Configuration conf = HBaseConfiguration.create();* >> > * HTable table = new HTable(conf, "mappings");* >> > * Scan s = new Scan();* >> > * ResultScanner rs = table.getScanner(s);* >> > * int count = 0;* >> > * String[] localnames;* >> > * for (Result r : rs) {* >> > * count++;* >> > * localnames = Bytes.toString(r.getValue(Bytes.toBytes("cf"), >> > Bytes.toBytes("LOC"))).trim().split(",");* >> > * for(String str : localnames){* >> > * addDoc(w, "." + str + ".", >> Bytes.toString(r.getValue(Bytes.toBytes("cf"), >> > Bytes.toBytes("CON"))), Bytes.toString(r.getRow()));* >> > * }* >> > * }* >> > * System.out.println("COUNT : " + count);* >> > * table.close();* >> > * w.close();* >> > * * >> > * // 2. query* >> > * >> > * >> > * String term = "";* >> > *// BufferedReader br = new BufferedReader(new >> > InputStreamReader(System.in)); * >> > *// System.out.println("Enter the term you want to search...");* >> > *// term = br.readLine();* >> > * term = "Vacuolated Lymphocytes";* >> > * TermQuery tq = new TermQuery(new Term("localname", "." + term + >> "."));* >> > * >> > * >> > * // 3. search* >> > * int hitsPerPage = 10;* >> > * IndexReader reader = DirectoryReader.open(index);* >> > * IndexSearcher searcher = new IndexSearcher(reader);* >> > * TopScoreDocCollector collector = >> > TopScoreDocCollector.create(hitsPerPage, true);* >> > * searcher.search(tq, collector);* >> > * ScoreDoc[] hits = collector.topDocs().scoreDocs;* >> > * * >> > * // 4. display results* >> > * System.out.println("Found " + hits.length + " hits.");* >> > * for(int i=0;i<hits.length;++i) {* >> > * int docId = hits[i].doc;* >> > * Document d = searcher.doc(docId);* >> > * System.out.println("ControlID -> " + d.get("controlid") + "\t" + >> > "Localnames -> " + d.get("localname") + "\t" + "Controname -> " + >> > d.get("controlname"));* >> > * }* >> > * // reader can only be closed when there* >> > * // is no need to access the documents any more.* >> > * reader.close();* >> > * }* >> > * >> > * >> > * private static void addDoc(IndexWriter w, String local, String control, >> > String rowkey) throws IOException {* >> > * >> > * >> > * Document doc = new Document();* >> > * doc.add(new StringField("localname", local, Field.Store.YES));* >> > * doc.add(new StringField("controlname", control, Field.Store.YES));* >> > * doc.add(new StringField("controlid", rowkey, Field.Store.YES)); * >> > * w.addDocument(doc);* >> > * }* >> > * >> > * >> > Does it look fine to you? Or can I make it better by adding or removing >> > something?Although it shows just a primitive usage of Lucene, it is >> always >> > better to have some able guidance with us. >> > >> > One more question. Does the index remain alive only till the lifetime of >> > the application if we are using *RAMDirectory*? I have to run the entire >> > process everytime I want to search something. >> > >> > Also, I have added a dot(.) before and after after each word before >> adding >> > it to the document so that I can do *exact match search*. Is my approach >> > correct or is there any other OOTB feature available in Lucene which I >> can >> > use for this? >> > >> > I am sorry to be a pest of questions and thank you so much for your time. >> > >> > Warm Regards, >> > Tariq >> > https://mtariq.jux.com/ >> > cloudfront.blogspot.com >> > >> > >> > On Mon, Feb 11, 2013 at 10:09 PM, Mohammad Tariq <donta...@gmail.com> >> wrote: >> > >> >> Hey Ian. Thank you so much for the quick reply. I'll definitely give >> >> Lucene a shot. I'll start off with it and get back to you in case of any >> >> problem. >> >> >> >> Many thanks. >> >> >> >> Warm Regards, >> >> Tariq >> >> https://mtariq.jux.com/ >> >> cloudfront.blogspot.com >> >> >> >> >> >> On Mon, Feb 11, 2013 at 10:03 PM, Ian Lea <ian....@gmail.com> wrote: >> >> >> >>> You can certainly use lucene for this, and it will be blindingly fast >> >>> even if you use a disk based index. >> >>> >> >>> Just index documents as you've laid it out, with the field you want to >> >>> search on added as indexable and the others stored. >> >>> >> >>> I've never used Guava Table so can't comment on that, but with only a >> >>> few thousand words it would certainly be feasible to use something >> >>> like that. Better? I don't know. >> >>> >> >>> Personally I'd probably go with lucene as I'd be positive it would a) >> >>> work and b) be fast even if the thousands ending being tens of >> >>> thousands, or more. >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Ian. >> >>> >> >>> On Mon, Feb 11, 2013 at 3:14 PM, Mohammad Tariq <donta...@gmail.com> >> >>> wrote: >> >>> > Hello list, >> >>> > >> >>> > I have a scenario wherein I need an in-memory index as I >> need >> >>> > faster search. The problem goes like this : >> >>> > >> >>> > I have a list which contains a couple of thousands words. Each word >> has >> >>> a >> >>> > corresponding ID and a list of synonyms. The actual word is a column >> in >> >>> my >> >>> > Hbase table. I get files which contain values for this column and I >> >>> have to >> >>> > extract values from these files and put them into the appropriate >> >>> column. >> >>> > But sometimes files may contain the synonym instead of the actual >> word. >> >>> > Now, this is the place where index come into picture. I should have >> an >> >>> > index that contains all the words along with its ID and all the >> synonyms >> >>> > and it should be in-memory always so that inserts into Hbase are >> quick. >> >>> > Something like this : >> >>> > >> >>> > ID WORD SYNONYMS >> >>> > 13991 A a, A, Aa, aa, AA >> >>> > >> >>> > Then the index should be something like this : >> >>> > a A 13991 >> >>> > A A 13991 >> >>> > Aa A 13991 >> >>> > aa A 13991 >> >>> > AA A 13991 >> >>> > >> >>> > So that if I get 'a' in the file, I should be able to do a lookup and >> >>> index >> >>> > should give me 'A' along with '13991'. I need both the base name and >> the >> >>> > ID. The names could even be strings of 4 to 5 words. >> >>> > >> >>> > I have all this information stored in a Hbase table having two >> columns >> >>> > where the first column contains the actual word and the second column >> >>> > contains the entire list of synonyms. And the rowkey is the ID. >> >>> > >> >>> > Now. I am not getting whether it is feasible to use Lucene to get >> this >> >>> or >> >>> > should I go with something like 'Guava Table' or something else. >> Need >> >>> some >> >>> > guidance as being new to Lucene I am not able to think in the right >> >>> > direction. If it is feasible to use Lucene to achieve this how to do >> it >> >>> > efficiently? >> >>> > >> >>> > I am using Hbase filters right now to do the fetch which is slowing >> down >> >>> > the process. >> >>> > >> >>> > I am sorry if my questions sound too childish or senseless as I am >> not >> >>> very >> >>> > good at Lucene. Thank you so much for your valuable time. >> >>> > >> >>> > Warm Regards, >> >>> > Tariq >> >>> > https://mtariq.jux.com/ >> >>> > cloudfront.blogspot.com >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>> >> >>> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org