Re: conditional High Freq Terms in Lucene index

Michael McCandless Sat, 31 Mar 2012 03:18:47 -0700

One big problem is your collector (that gathers all "A" doc IDs) is
not mapping the per-segment docID to the top-level global docID space.


You need to save the docBase that was passed to setNextReader, and
then add it back in on each collect call.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Mar 30, 2012 at 7:23 PM, starz10de <[email protected]> wrote:
> Thanks for your hint.
>
> I tried simple solution as following:
> Firstly I determine the document type “A” and stored them in an array by
> searching the field document type in the index:
> public static void doStreamingSearch(final Searcher searcher, Query query)
>                        throws IOException {
>
>
>                Collector streamingHitCollector = new Collector() {
>                        // simply print docId and score of every matching 
> document
>                        @Override
>                        public void collect(int doc) throws IOException {
>                                c++;
>                        //      System.out.println("doc=" + doc);
>
>                                doc_id.add(doc+"");
>                                //  System.out.println("doc=" + doc  );
>                                // scorer.score());
>                        }
>
>                        @Override
>                        public boolean acceptsDocsOutOfOrder() {
>                                return true;
>                        }
>
>                        @Override
>                        public void setNextReader(IndexReader arg0, int arg1)
>                                        throws IOException {
>                                // TODO Auto-generated method stub
>
>                        }
>
>                        @Override
>                        public void setScorer(Scorer arg0) throws IOException {
>                                // TODO Auto-generated method stub
>
>                        }
>
>                };
>
>                 searcher.search(query, streamingHitCollector);
>
>        }
> Then I modified the HighFrequentTerm in lucene as follows:
> while (terms.next()) {
>
>      dok.seek(terms);
>
>        while (dok.next()) {
>
>
>
>                  for(int i=0;i< doc_id.size();++i)
>                         {
>
>                    if( doc_id.get(i).equals(dok.doc()+""))
>                    {
>                         if (terms.term().field().equals(field)  ) {
>
> tiq.insertWithOverflow(new TermInfo(terms.term(), dok.freq()));
>                                }
>
>                    }
> I could test that i correctly have only the document type „A“. However, the
> result is not correct because I can see few terms twice in the ordered high
> frequent list.
>
> Any hints where are the problem?
>
> Michael McCandless-2 wrote
>>
>> You'd have to modify HighFreqTerm's sources...
>>
>> Roughly...
>>
>> First, make a bitset recording which docs are type A (eg, use
>> FieldCache), second, change HighFreqTerms so that for each term, it
>> walks the postings, counting how many type A docs there were, then...
>> just use the rest of HighFreqTerms (priority queue, etc.).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote:
>>> HI,
>>>
>>> I am using HighFreqTerms class to compute the high frequent terms in the
>>> Lucene index and it works well. However, I am interested to compute the
>>> high
>>> frequent terms under some condition. I would like to compute the high
>>> frequent terms not for all documents in the index instead only for
>>> documents
>>> with type “A”. Beside the “contents” field in the index I have also the
>>> “DocType” (document type) in the index as extra field.
>>> So I should compute the high frequent term only  (if DocType=”A”)
>>>
>>> Any idea how to do this?
>>>
>>> Thanks
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html
>>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> Michael McCandless-2 wrote
>>
>> You'd have to modify HighFreqTerm's sources...
>>
>> Roughly...
>>
>> First, make a bitset recording which docs are type A (eg, use
>> FieldCache), second, change HighFreqTerms so that for each term, it
>> walks the postings, counting how many type A docs there were, then...
>> just use the rest of HighFreqTerms (priority queue, etc.).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote:
>>> HI,
>>>
>>> I am using HighFreqTerms class to compute the high frequent terms in the
>>> Lucene index and it works well. However, I am interested to compute the
>>> high
>>> frequent terms under some condition. I would like to compute the high
>>> frequent terms not for all documents in the index instead only for
>>> documents
>>> with type “A”. Beside the “contents” field in the index I have also the
>>> “DocType” (document type) in the index as extra field.
>>> So I should compute the high frequent term only  (if DocType=”A”)
>>>
>>> Any idea how to do this?
>>>
>>> Thanks
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html
>>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> Michael McCandless-2 wrote
>>
>> You'd have to modify HighFreqTerm's sources...
>>
>> Roughly...
>>
>> First, make a bitset recording which docs are type A (eg, use
>> FieldCache), second, change HighFreqTerms so that for each term, it
>> walks the postings, counting how many type A docs there were, then...
>> just use the rest of HighFreqTerms (priority queue, etc.).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote:
>>> HI,
>>>
>>> I am using HighFreqTerms class to compute the high frequent terms in the
>>> Lucene index and it works well. However, I am interested to compute the
>>> high
>>> frequent terms under some condition. I would like to compute the high
>>> frequent terms not for all documents in the index instead only for
>>> documents
>>> with type “A”. Beside the “contents” field in the index I have also the
>>> “DocType” (document type) in the index as extra field.
>>> So I should compute the high frequent term only  (if DocType=”A”)
>>>
>>> Any idea how to do this?
>>>
>>> Thanks
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html
>>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3872298.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: conditional High Freq Terms in Lucene index

Reply via email to