One big problem is your collector (that gathers all "A" doc IDs) is not mapping the per-segment docID to the top-level global docID space.
You need to save the docBase that was passed to setNextReader, and then add it back in on each collect call. Mike McCandless http://blog.mikemccandless.com On Fri, Mar 30, 2012 at 7:23 PM, starz10de <[email protected]> wrote: > Thanks for your hint. > > I tried simple solution as following: > Firstly I determine the document type “A” and stored them in an array by > searching the field document type in the index: > public static void doStreamingSearch(final Searcher searcher, Query query) > throws IOException { > > > Collector streamingHitCollector = new Collector() { > // simply print docId and score of every matching > document > @Override > public void collect(int doc) throws IOException { > c++; > // System.out.println("doc=" + doc); > > doc_id.add(doc+""); > // System.out.println("doc=" + doc ); > // scorer.score()); > } > > @Override > public boolean acceptsDocsOutOfOrder() { > return true; > } > > @Override > public void setNextReader(IndexReader arg0, int arg1) > throws IOException { > // TODO Auto-generated method stub > > } > > @Override > public void setScorer(Scorer arg0) throws IOException { > // TODO Auto-generated method stub > > } > > }; > > searcher.search(query, streamingHitCollector); > > } > Then I modified the HighFrequentTerm in lucene as follows: > while (terms.next()) { > > dok.seek(terms); > > while (dok.next()) { > > > > for(int i=0;i< doc_id.size();++i) > { > > if( doc_id.get(i).equals(dok.doc()+"")) > { > if (terms.term().field().equals(field) ) { > > tiq.insertWithOverflow(new TermInfo(terms.term(), dok.freq())); > } > > } > I could test that i correctly have only the document type „A“. However, the > result is not correct because I can see few terms twice in the ordered high > frequent list. > > Any hints where are the problem? > > Michael McCandless-2 wrote >> >> You'd have to modify HighFreqTerm's sources... >> >> Roughly... >> >> First, make a bitset recording which docs are type A (eg, use >> FieldCache), second, change HighFreqTerms so that for each term, it >> walks the postings, counting how many type A docs there were, then... >> just use the rest of HighFreqTerms (priority queue, etc.). >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote: >>> HI, >>> >>> I am using HighFreqTerms class to compute the high frequent terms in the >>> Lucene index and it works well. However, I am interested to compute the >>> high >>> frequent terms under some condition. I would like to compute the high >>> frequent terms not for all documents in the index instead only for >>> documents >>> with type “A”. Beside the “contents” field in the index I have also the >>> “DocType” (document type) in the index as extra field. >>> So I should compute the high frequent term only (if DocType=”A”) >>> >>> Any idea how to do this? >>> >>> Thanks >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html >>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > Michael McCandless-2 wrote >> >> You'd have to modify HighFreqTerm's sources... >> >> Roughly... >> >> First, make a bitset recording which docs are type A (eg, use >> FieldCache), second, change HighFreqTerms so that for each term, it >> walks the postings, counting how many type A docs there were, then... >> just use the rest of HighFreqTerms (priority queue, etc.). >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote: >>> HI, >>> >>> I am using HighFreqTerms class to compute the high frequent terms in the >>> Lucene index and it works well. However, I am interested to compute the >>> high >>> frequent terms under some condition. I would like to compute the high >>> frequent terms not for all documents in the index instead only for >>> documents >>> with type “A”. Beside the “contents” field in the index I have also the >>> “DocType” (document type) in the index as extra field. >>> So I should compute the high frequent term only (if DocType=”A”) >>> >>> Any idea how to do this? >>> >>> Thanks >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html >>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > Michael McCandless-2 wrote >> >> You'd have to modify HighFreqTerm's sources... >> >> Roughly... >> >> First, make a bitset recording which docs are type A (eg, use >> FieldCache), second, change HighFreqTerms so that for each term, it >> walks the postings, counting how many type A docs there were, then... >> just use the rest of HighFreqTerms (priority queue, etc.). >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote: >>> HI, >>> >>> I am using HighFreqTerms class to compute the high frequent terms in the >>> Lucene index and it works well. However, I am interested to compute the >>> high >>> frequent terms under some condition. I would like to compute the high >>> frequent terms not for all documents in the index instead only for >>> documents >>> with type “A”. Beside the “contents” field in the index I have also the >>> “DocType” (document type) in the index as extra field. >>> So I should compute the high frequent term only (if DocType=”A”) >>> >>> Any idea how to do this? >>> >>> Thanks >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html >>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3872298.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
