Re: Optimal way to index

Mohammad Tariq Tue, 12 Feb 2013 12:19:58 -0800

It actually did. I'll cross check once more and make sure I was doing it
correctly.


Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Feb 13, 2013 at 1:44 AM, Ian Lea <[email protected]> wrote:

> "AA-" indexed as a StringField was matched by a TermQuery for "AA"?
> Sounds surprising.
>
>
> --
> Ian.
>
>
> On Tue, Feb 12, 2013 at 6:32 PM, Mohammad Tariq <[email protected]>
> wrote:
> > Thanks again Ian. I'll make the changes suggested by you. And I am using
> > dots because if I search for 'AA' it was giving me 'AA-' as well.
> >
> > Warm Regards,
> > Tariq
> > https://mtariq.jux.com/
> > cloudfront.blogspot.com
> >
> >
> > On Tue, Feb 12, 2013 at 9:50 PM, Ian Lea <[email protected]> wrote:
> >
> >> From a glance it looks fine.  I don't see what you gain by adding dots
> >> - you are using a TermQuery which will only do exact matches.  Since
> >> you're using StringField your text won't be tokenized but stored as
> >> is.  I see you're searching on a mixed case term - that's fine as long
> >> as you don't expect "aaa" to match "AAA".  I tend to just downcase
> >> everything because I've wasted so much time over the years on silly
> >> case sensitive bugs.
> >>
> >> RAMDirectory instances will disappear when the application ends so
> >> yes, you'll need to reload on startup.  You don't have to recreate for
> >> each search though - create and populate the RAMDirectory on startup
> >> and create an IndexSearcher and use that for all searches.
> >>
> >> Depending on your app it might be easier to use a normal disk based
> >> index.  It will probably be fast enough.
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Tue, Feb 12, 2013 at 1:29 PM, Mohammad Tariq <[email protected]>
> >> wrote:
> >> > Hello Ian,
> >> > *
> >> > *
> >> >      I started as directed by you and created the index. Here is a
> small
> >> > piece of code which I have written. Please have a look over it :
> >> > *
> >> > *
> >> > *public static void main(String[] args) throws IOException,
> >> ParseException {
> >> > *
> >> > *  *
> >> > *    //Specify the analyzer for tokenizing text. The same analyzer
> should
> >> > be used for indexing and searching*
> >> > *    StandardAnalyzer analyzer = new
> >> StandardAnalyzer(Version.LUCENE_40);*
> >> > *
> >> > *
> >> > *    // 1. create the index*
> >> > *    Directory index = new RAMDirectory();*
> >> > *    IndexWriterConfig config = new
> IndexWriterConfig(Version.LUCENE_40,
> >> > analyzer);*
> >> > *    IndexWriter w = new IndexWriter(index, config);*
> >> > *    Configuration conf = HBaseConfiguration.create();*
> >> > * HTable table = new HTable(conf, "mappings");*
> >> > * Scan s = new Scan();*
> >> > * ResultScanner rs = table.getScanner(s);*
> >> > * int count = 0;*
> >> > * String[] localnames;*
> >> > * for (Result r : rs) {*
> >> > * count++;*
> >> > * localnames = Bytes.toString(r.getValue(Bytes.toBytes("cf"),
> >> > Bytes.toBytes("LOC"))).trim().split(",");*
> >> > * for(String str : localnames){*
> >> > * addDoc(w, "." + str + ".",
> >> Bytes.toString(r.getValue(Bytes.toBytes("cf"),
> >> > Bytes.toBytes("CON"))), Bytes.toString(r.getRow()));*
> >> > * }*
> >> > * }*
> >> > * System.out.println("COUNT : " + count);*
> >> > * table.close();*
> >> > *    w.close();*
> >> > *    *
> >> > *    // 2. query*
> >> > *
> >> > *
> >> > *    String term = "";*
> >> > *//    BufferedReader br = new BufferedReader(new
> >> > InputStreamReader(System.in));    *
> >> > *//    System.out.println("Enter the term you want to search...");*
> >> > *//    term = br.readLine();*
> >> > *    term = "Vacuolated Lymphocytes";*
> >> > *    TermQuery tq = new TermQuery(new Term("localname", "." + term +
> >> "."));*
> >> > *
> >> > *
> >> > *    // 3. search*
> >> > *    int hitsPerPage = 10;*
> >> > *    IndexReader reader = DirectoryReader.open(index);*
> >> > *    IndexSearcher searcher = new IndexSearcher(reader);*
> >> > *    TopScoreDocCollector collector =
> >> > TopScoreDocCollector.create(hitsPerPage, true);*
> >> > *    searcher.search(tq, collector);*
> >> > *    ScoreDoc[] hits = collector.topDocs().scoreDocs;*
> >> > *    *
> >> > *    // 4. display results*
> >> > *    System.out.println("Found " + hits.length + " hits.");*
> >> > *    for(int i=0;i<hits.length;++i) {*
> >> > *      int docId = hits[i].doc;*
> >> > *      Document d = searcher.doc(docId);*
> >> > *      System.out.println("ControlID -> "  + d.get("controlid") +
> "\t" +
> >> > "Localnames -> " + d.get("localname") + "\t" + "Controname -> " +
> >> > d.get("controlname"));*
> >> > *    }*
> >> > *    // reader can only be closed when there*
> >> > *    // is no need to access the documents any more.*
> >> > *    reader.close();*
> >> > *  }*
> >> > *
> >> > *
> >> > * private static void addDoc(IndexWriter w, String local, String
> control,
> >> > String rowkey) throws IOException {*
> >> > *
> >> > *
> >> > * Document doc = new Document();*
> >> > * doc.add(new StringField("localname", local, Field.Store.YES));*
> >> > * doc.add(new StringField("controlname", control, Field.Store.YES));*
> >> > * doc.add(new StringField("controlid", rowkey, Field.Store.YES)); *
> >> > * w.addDocument(doc);*
> >> > * }*
> >> > *
> >> > *
> >> > Does it look fine to you? Or can I make it better by adding or
> removing
> >> > something?Although it shows just a primitive usage of Lucene, it is
> >> always
> >> > better to have some able guidance with us.
> >> >
> >> > One more question. Does the index remain alive only till the lifetime
> of
> >> > the application if we are using *RAMDirectory*? I have to run the
> entire
> >> > process everytime I want to search something.
> >> >
> >> > Also, I have added a dot(.) before and after after each word before
> >> adding
> >> > it to the document so that I can do *exact match search*. Is my
> approach
> >> > correct or is there any other OOTB feature available in Lucene which I
> >> can
> >> > use for this?
> >> >
> >> > I am sorry to be a pest of questions and thank you so much for your
> time.
> >> >
> >> > Warm Regards,
> >> > Tariq
> >> > https://mtariq.jux.com/
> >> > cloudfront.blogspot.com
> >> >
> >> >
> >> > On Mon, Feb 11, 2013 at 10:09 PM, Mohammad Tariq <[email protected]>
> >> wrote:
> >> >
> >> >> Hey Ian. Thank you so much for the quick reply. I'll definitely give
> >> >> Lucene a shot. I'll start off with it and get back to you in case of
> any
> >> >> problem.
> >> >>
> >> >> Many thanks.
> >> >>
> >> >> Warm Regards,
> >> >> Tariq
> >> >> https://mtariq.jux.com/
> >> >> cloudfront.blogspot.com
> >> >>
> >> >>
> >> >> On Mon, Feb 11, 2013 at 10:03 PM, Ian Lea <[email protected]> wrote:
> >> >>
> >> >>> You can certainly use lucene for this, and it will be blindingly
> fast
> >> >>> even if you use a disk based index.
> >> >>>
> >> >>> Just index documents as you've laid it out, with the field you want
> to
> >> >>> search on added as indexable and the others stored.
> >> >>>
> >> >>> I've never used Guava Table so can't comment on that, but with only
> a
> >> >>> few thousand words it would certainly be feasible to use something
> >> >>> like that.  Better?  I don't know.
> >> >>>
> >> >>> Personally I'd probably go with lucene as I'd be positive it would
> a)
> >> >>> work and b) be fast even if the thousands ending being tens of
> >> >>> thousands, or more.
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Ian.
> >> >>>
> >> >>> On Mon, Feb 11, 2013 at 3:14 PM, Mohammad Tariq <[email protected]
> >
> >> >>> wrote:
> >> >>> > Hello list,
> >> >>> >
> >> >>> >          I have a scenario wherein I need an in-memory index as I
> >> need
> >> >>> > faster search. The problem goes like this :
> >> >>> >
> >> >>> > I have a list which contains a couple of thousands words. Each
> word
> >> has
> >> >>> a
> >> >>> > corresponding ID and a list of synonyms. The actual word is a
> column
> >> in
> >> >>> my
> >> >>> > Hbase table. I get files which contain values for this column and
> I
> >> >>> have to
> >> >>> > extract values from these files and put them into the appropriate
> >> >>> column.
> >> >>> > But sometimes files may contain the synonym instead of the actual
> >> word.
> >> >>> > Now, this is the place where index come into picture. I should
> have
> >> an
> >> >>> > index that contains all the words along with its ID and all the
> >> synonyms
> >> >>> > and it should be in-memory always so that inserts into Hbase are
> >> quick.
> >> >>> > Something like this :
> >> >>> >
> >> >>> >  ID          WORD           SYNONYMS
> >> >>> >  13991     A                  a, A, Aa, aa, AA
> >> >>> >
> >> >>> > Then the index should be something like this :
> >> >>> > a    A   13991
> >> >>> > A    A   13991
> >> >>> > Aa  A   13991
> >> >>> > aa   A   13991
> >> >>> > AA  A   13991
> >> >>> >
> >> >>> > So that if I get 'a' in the file, I should be able to do a lookup
> and
> >> >>> index
> >> >>> > should give me 'A' along with '13991'. I need both the base name
> and
> >> the
> >> >>> > ID. The names could even be strings of 4 to 5 words.
> >> >>> >
> >> >>> > I have all this information stored in a Hbase table having two
> >> columns
> >> >>> > where the first column contains the actual word and the second
> column
> >> >>> > contains the entire list of synonyms. And the rowkey is the ID.
> >> >>> >
> >> >>> > Now. I am not getting whether it is feasible to use Lucene to get
> >> this
> >> >>> or
> >> >>> >  should I go with something like 'Guava Table' or something else.
> >> Need
> >> >>> some
> >> >>> > guidance as being new to Lucene I am not able to think in the
> right
> >> >>> > direction. If it is feasible to use Lucene to achieve this how to
> do
> >> it
> >> >>> > efficiently?
> >> >>> >
> >> >>> > I am using Hbase filters right now to do the fetch which is
> slowing
> >> down
> >> >>> > the process.
> >> >>> >
> >> >>> > I am sorry if my questions sound too childish or senseless as I am
> >> not
> >> >>> very
> >> >>> > good at Lucene. Thank you so much for your valuable time.
> >> >>> >
> >> >>> > Warm Regards,
> >> >>> > Tariq
> >> >>> > https://mtariq.jux.com/
> >> >>> > cloudfront.blogspot.com
> >> >>>
> >> >>>
> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: [email protected]
> >> >>> For additional commands, e-mail: [email protected]
> >> >>>
> >> >>>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Optimal way to index

Reply via email to