Re: Optimal way to index

Mohammad Tariq Tue, 12 Feb 2013 10:33:22 -0800

Thanks again Ian. I'll make the changes suggested by you. And I am using
dots because if I search for 'AA' it was giving me 'AA-' as well.


Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 12, 2013 at 9:50 PM, Ian Lea <ian....@gmail.com> wrote:

> From a glance it looks fine.  I don't see what you gain by adding dots
> - you are using a TermQuery which will only do exact matches.  Since
> you're using StringField your text won't be tokenized but stored as
> is.  I see you're searching on a mixed case term - that's fine as long
> as you don't expect "aaa" to match "AAA".  I tend to just downcase
> everything because I've wasted so much time over the years on silly
> case sensitive bugs.
>
> RAMDirectory instances will disappear when the application ends so
> yes, you'll need to reload on startup.  You don't have to recreate for
> each search though - create and populate the RAMDirectory on startup
> and create an IndexSearcher and use that for all searches.
>
> Depending on your app it might be easier to use a normal disk based
> index.  It will probably be fast enough.
>
>
> --
> Ian.
>
>
> On Tue, Feb 12, 2013 at 1:29 PM, Mohammad Tariq <donta...@gmail.com>
> wrote:
> > Hello Ian,
> > *
> > *
> >      I started as directed by you and created the index. Here is a small
> > piece of code which I have written. Please have a look over it :
> > *
> > *
> > *public static void main(String[] args) throws IOException,
> ParseException {
> > *
> > *  *
> > *    //Specify the analyzer for tokenizing text. The same analyzer should
> > be used for indexing and searching*
> > *    StandardAnalyzer analyzer = new
> StandardAnalyzer(Version.LUCENE_40);*
> > *
> > *
> > *    // 1. create the index*
> > *    Directory index = new RAMDirectory();*
> > *    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
> > analyzer);*
> > *    IndexWriter w = new IndexWriter(index, config);*
> > *    Configuration conf = HBaseConfiguration.create();*
> > * HTable table = new HTable(conf, "mappings");*
> > * Scan s = new Scan();*
> > * ResultScanner rs = table.getScanner(s);*
> > * int count = 0;*
> > * String[] localnames;*
> > * for (Result r : rs) {*
> > * count++;*
> > * localnames = Bytes.toString(r.getValue(Bytes.toBytes("cf"),
> > Bytes.toBytes("LOC"))).trim().split(",");*
> > * for(String str : localnames){*
> > * addDoc(w, "." + str + ".",
> Bytes.toString(r.getValue(Bytes.toBytes("cf"),
> > Bytes.toBytes("CON"))), Bytes.toString(r.getRow()));*
> > * }*
> > * }*
> > * System.out.println("COUNT : " + count);*
> > * table.close();*
> > *    w.close();*
> > *    *
> > *    // 2. query*
> > *
> > *
> > *    String term = "";*
> > *//    BufferedReader br = new BufferedReader(new
> > InputStreamReader(System.in));    *
> > *//    System.out.println("Enter the term you want to search...");*
> > *//    term = br.readLine();*
> > *    term = "Vacuolated Lymphocytes";*
> > *    TermQuery tq = new TermQuery(new Term("localname", "." + term +
> "."));*
> > *
> > *
> > *    // 3. search*
> > *    int hitsPerPage = 10;*
> > *    IndexReader reader = DirectoryReader.open(index);*
> > *    IndexSearcher searcher = new IndexSearcher(reader);*
> > *    TopScoreDocCollector collector =
> > TopScoreDocCollector.create(hitsPerPage, true);*
> > *    searcher.search(tq, collector);*
> > *    ScoreDoc[] hits = collector.topDocs().scoreDocs;*
> > *    *
> > *    // 4. display results*
> > *    System.out.println("Found " + hits.length + " hits.");*
> > *    for(int i=0;i<hits.length;++i) {*
> > *      int docId = hits[i].doc;*
> > *      Document d = searcher.doc(docId);*
> > *      System.out.println("ControlID -> "  + d.get("controlid") + "\t" +
> > "Localnames -> " + d.get("localname") + "\t" + "Controname -> " +
> > d.get("controlname"));*
> > *    }*
> > *    // reader can only be closed when there*
> > *    // is no need to access the documents any more.*
> > *    reader.close();*
> > *  }*
> > *
> > *
> > * private static void addDoc(IndexWriter w, String local, String control,
> > String rowkey) throws IOException {*
> > *
> > *
> > * Document doc = new Document();*
> > * doc.add(new StringField("localname", local, Field.Store.YES));*
> > * doc.add(new StringField("controlname", control, Field.Store.YES));*
> > * doc.add(new StringField("controlid", rowkey, Field.Store.YES)); *
> > * w.addDocument(doc);*
> > * }*
> > *
> > *
> > Does it look fine to you? Or can I make it better by adding or removing
> > something?Although it shows just a primitive usage of Lucene, it is
> always
> > better to have some able guidance with us.
> >
> > One more question. Does the index remain alive only till the lifetime of
> > the application if we are using *RAMDirectory*? I have to run the entire
> > process everytime I want to search something.
> >
> > Also, I have added a dot(.) before and after after each word before
> adding
> > it to the document so that I can do *exact match search*. Is my approach
> > correct or is there any other OOTB feature available in Lucene which I
> can
> > use for this?
> >
> > I am sorry to be a pest of questions and thank you so much for your time.
> >
> > Warm Regards,
> > Tariq
> > https://mtariq.jux.com/
> > cloudfront.blogspot.com
> >
> >
> > On Mon, Feb 11, 2013 at 10:09 PM, Mohammad Tariq <donta...@gmail.com>
> wrote:
> >
> >> Hey Ian. Thank you so much for the quick reply. I'll definitely give
> >> Lucene a shot. I'll start off with it and get back to you in case of any
> >> problem.
> >>
> >> Many thanks.
> >>
> >> Warm Regards,
> >> Tariq
> >> https://mtariq.jux.com/
> >> cloudfront.blogspot.com
> >>
> >>
> >> On Mon, Feb 11, 2013 at 10:03 PM, Ian Lea <ian....@gmail.com> wrote:
> >>
> >>> You can certainly use lucene for this, and it will be blindingly fast
> >>> even if you use a disk based index.
> >>>
> >>> Just index documents as you've laid it out, with the field you want to
> >>> search on added as indexable and the others stored.
> >>>
> >>> I've never used Guava Table so can't comment on that, but with only a
> >>> few thousand words it would certainly be feasible to use something
> >>> like that.  Better?  I don't know.
> >>>
> >>> Personally I'd probably go with lucene as I'd be positive it would a)
> >>> work and b) be fast even if the thousands ending being tens of
> >>> thousands, or more.
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Ian.
> >>>
> >>> On Mon, Feb 11, 2013 at 3:14 PM, Mohammad Tariq <donta...@gmail.com>
> >>> wrote:
> >>> > Hello list,
> >>> >
> >>> >          I have a scenario wherein I need an in-memory index as I
> need
> >>> > faster search. The problem goes like this :
> >>> >
> >>> > I have a list which contains a couple of thousands words. Each word
> has
> >>> a
> >>> > corresponding ID and a list of synonyms. The actual word is a column
> in
> >>> my
> >>> > Hbase table. I get files which contain values for this column and I
> >>> have to
> >>> > extract values from these files and put them into the appropriate
> >>> column.
> >>> > But sometimes files may contain the synonym instead of the actual
> word.
> >>> > Now, this is the place where index come into picture. I should have
> an
> >>> > index that contains all the words along with its ID and all the
> synonyms
> >>> > and it should be in-memory always so that inserts into Hbase are
> quick.
> >>> > Something like this :
> >>> >
> >>> >  ID          WORD           SYNONYMS
> >>> >  13991     A                  a, A, Aa, aa, AA
> >>> >
> >>> > Then the index should be something like this :
> >>> > a    A   13991
> >>> > A    A   13991
> >>> > Aa  A   13991
> >>> > aa   A   13991
> >>> > AA  A   13991
> >>> >
> >>> > So that if I get 'a' in the file, I should be able to do a lookup and
> >>> index
> >>> > should give me 'A' along with '13991'. I need both the base name and
> the
> >>> > ID. The names could even be strings of 4 to 5 words.
> >>> >
> >>> > I have all this information stored in a Hbase table having two
> columns
> >>> > where the first column contains the actual word and the second column
> >>> > contains the entire list of synonyms. And the rowkey is the ID.
> >>> >
> >>> > Now. I am not getting whether it is feasible to use Lucene to get
> this
> >>> or
> >>> >  should I go with something like 'Guava Table' or something else.
> Need
> >>> some
> >>> > guidance as being new to Lucene I am not able to think in the right
> >>> > direction. If it is feasible to use Lucene to achieve this how to do
> it
> >>> > efficiently?
> >>> >
> >>> > I am using Hbase filters right now to do the fetch which is slowing
> down
> >>> > the process.
> >>> >
> >>> > I am sorry if my questions sound too childish or senseless as I am
> not
> >>> very
> >>> > good at Lucene. Thank you so much for your valuable time.
> >>> >
> >>> > Warm Regards,
> >>> > Tariq
> >>> > https://mtariq.jux.com/
> >>> > cloudfront.blogspot.com
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Optimal way to index

Reply via email to