Thanks again Ian. I'll make the changes suggested by you. And I am using dots because if I search for 'AA' it was giving me 'AA-' as well.
Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 12, 2013 at 9:50 PM, Ian Lea <ian....@gmail.com> wrote: > From a glance it looks fine. I don't see what you gain by adding dots > - you are using a TermQuery which will only do exact matches. Since > you're using StringField your text won't be tokenized but stored as > is. I see you're searching on a mixed case term - that's fine as long > as you don't expect "aaa" to match "AAA". I tend to just downcase > everything because I've wasted so much time over the years on silly > case sensitive bugs. > > RAMDirectory instances will disappear when the application ends so > yes, you'll need to reload on startup. You don't have to recreate for > each search though - create and populate the RAMDirectory on startup > and create an IndexSearcher and use that for all searches. > > Depending on your app it might be easier to use a normal disk based > index. It will probably be fast enough. > > > -- > Ian. > > > On Tue, Feb 12, 2013 at 1:29 PM, Mohammad Tariq <donta...@gmail.com> > wrote: > > Hello Ian, > > * > > * > > I started as directed by you and created the index. Here is a small > > piece of code which I have written. Please have a look over it : > > * > > * > > *public static void main(String[] args) throws IOException, > ParseException { > > * > > * * > > * //Specify the analyzer for tokenizing text. The same analyzer should > > be used for indexing and searching* > > * StandardAnalyzer analyzer = new > StandardAnalyzer(Version.LUCENE_40);* > > * > > * > > * // 1. create the index* > > * Directory index = new RAMDirectory();* > > * IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, > > analyzer);* > > * IndexWriter w = new IndexWriter(index, config);* > > * Configuration conf = HBaseConfiguration.create();* > > * HTable table = new HTable(conf, "mappings");* > > * Scan s = new Scan();* > > * ResultScanner rs = table.getScanner(s);* > > * int count = 0;* > > * String[] localnames;* > > * for (Result r : rs) {* > > * count++;* > > * localnames = Bytes.toString(r.getValue(Bytes.toBytes("cf"), > > Bytes.toBytes("LOC"))).trim().split(",");* > > * for(String str : localnames){* > > * addDoc(w, "." + str + ".", > Bytes.toString(r.getValue(Bytes.toBytes("cf"), > > Bytes.toBytes("CON"))), Bytes.toString(r.getRow()));* > > * }* > > * }* > > * System.out.println("COUNT : " + count);* > > * table.close();* > > * w.close();* > > * * > > * // 2. query* > > * > > * > > * String term = "";* > > *// BufferedReader br = new BufferedReader(new > > InputStreamReader(System.in)); * > > *// System.out.println("Enter the term you want to search...");* > > *// term = br.readLine();* > > * term = "Vacuolated Lymphocytes";* > > * TermQuery tq = new TermQuery(new Term("localname", "." + term + > "."));* > > * > > * > > * // 3. search* > > * int hitsPerPage = 10;* > > * IndexReader reader = DirectoryReader.open(index);* > > * IndexSearcher searcher = new IndexSearcher(reader);* > > * TopScoreDocCollector collector = > > TopScoreDocCollector.create(hitsPerPage, true);* > > * searcher.search(tq, collector);* > > * ScoreDoc[] hits = collector.topDocs().scoreDocs;* > > * * > > * // 4. display results* > > * System.out.println("Found " + hits.length + " hits.");* > > * for(int i=0;i<hits.length;++i) {* > > * int docId = hits[i].doc;* > > * Document d = searcher.doc(docId);* > > * System.out.println("ControlID -> " + d.get("controlid") + "\t" + > > "Localnames -> " + d.get("localname") + "\t" + "Controname -> " + > > d.get("controlname"));* > > * }* > > * // reader can only be closed when there* > > * // is no need to access the documents any more.* > > * reader.close();* > > * }* > > * > > * > > * private static void addDoc(IndexWriter w, String local, String control, > > String rowkey) throws IOException {* > > * > > * > > * Document doc = new Document();* > > * doc.add(new StringField("localname", local, Field.Store.YES));* > > * doc.add(new StringField("controlname", control, Field.Store.YES));* > > * doc.add(new StringField("controlid", rowkey, Field.Store.YES)); * > > * w.addDocument(doc);* > > * }* > > * > > * > > Does it look fine to you? Or can I make it better by adding or removing > > something?Although it shows just a primitive usage of Lucene, it is > always > > better to have some able guidance with us. > > > > One more question. Does the index remain alive only till the lifetime of > > the application if we are using *RAMDirectory*? I have to run the entire > > process everytime I want to search something. > > > > Also, I have added a dot(.) before and after after each word before > adding > > it to the document so that I can do *exact match search*. Is my approach > > correct or is there any other OOTB feature available in Lucene which I > can > > use for this? > > > > I am sorry to be a pest of questions and thank you so much for your time. > > > > Warm Regards, > > Tariq > > https://mtariq.jux.com/ > > cloudfront.blogspot.com > > > > > > On Mon, Feb 11, 2013 at 10:09 PM, Mohammad Tariq <donta...@gmail.com> > wrote: > > > >> Hey Ian. Thank you so much for the quick reply. I'll definitely give > >> Lucene a shot. I'll start off with it and get back to you in case of any > >> problem. > >> > >> Many thanks. > >> > >> Warm Regards, > >> Tariq > >> https://mtariq.jux.com/ > >> cloudfront.blogspot.com > >> > >> > >> On Mon, Feb 11, 2013 at 10:03 PM, Ian Lea <ian....@gmail.com> wrote: > >> > >>> You can certainly use lucene for this, and it will be blindingly fast > >>> even if you use a disk based index. > >>> > >>> Just index documents as you've laid it out, with the field you want to > >>> search on added as indexable and the others stored. > >>> > >>> I've never used Guava Table so can't comment on that, but with only a > >>> few thousand words it would certainly be feasible to use something > >>> like that. Better? I don't know. > >>> > >>> Personally I'd probably go with lucene as I'd be positive it would a) > >>> work and b) be fast even if the thousands ending being tens of > >>> thousands, or more. > >>> > >>> > >>> > >>> > >>> -- > >>> Ian. > >>> > >>> On Mon, Feb 11, 2013 at 3:14 PM, Mohammad Tariq <donta...@gmail.com> > >>> wrote: > >>> > Hello list, > >>> > > >>> > I have a scenario wherein I need an in-memory index as I > need > >>> > faster search. The problem goes like this : > >>> > > >>> > I have a list which contains a couple of thousands words. Each word > has > >>> a > >>> > corresponding ID and a list of synonyms. The actual word is a column > in > >>> my > >>> > Hbase table. I get files which contain values for this column and I > >>> have to > >>> > extract values from these files and put them into the appropriate > >>> column. > >>> > But sometimes files may contain the synonym instead of the actual > word. > >>> > Now, this is the place where index come into picture. I should have > an > >>> > index that contains all the words along with its ID and all the > synonyms > >>> > and it should be in-memory always so that inserts into Hbase are > quick. > >>> > Something like this : > >>> > > >>> > ID WORD SYNONYMS > >>> > 13991 A a, A, Aa, aa, AA > >>> > > >>> > Then the index should be something like this : > >>> > a A 13991 > >>> > A A 13991 > >>> > Aa A 13991 > >>> > aa A 13991 > >>> > AA A 13991 > >>> > > >>> > So that if I get 'a' in the file, I should be able to do a lookup and > >>> index > >>> > should give me 'A' along with '13991'. I need both the base name and > the > >>> > ID. The names could even be strings of 4 to 5 words. > >>> > > >>> > I have all this information stored in a Hbase table having two > columns > >>> > where the first column contains the actual word and the second column > >>> > contains the entire list of synonyms. And the rowkey is the ID. > >>> > > >>> > Now. I am not getting whether it is feasible to use Lucene to get > this > >>> or > >>> > should I go with something like 'Guava Table' or something else. > Need > >>> some > >>> > guidance as being new to Lucene I am not able to think in the right > >>> > direction. If it is feasible to use Lucene to achieve this how to do > it > >>> > efficiently? > >>> > > >>> > I am using Hbase filters right now to do the fetch which is slowing > down > >>> > the process. > >>> > > >>> > I am sorry if my questions sound too childish or senseless as I am > not > >>> very > >>> > good at Lucene. Thank you so much for your valuable time. > >>> > > >>> > Warm Regards, > >>> > Tariq > >>> > https://mtariq.jux.com/ > >>> > cloudfront.blogspot.com > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >>> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >