I had similar requirements of "count" and "group by" on over 130mil records, it's really a pain. It's currently usable but not satisfactory.
Currently it's grouping at run-time by iterating through ungrouped items. It collects matching documents into BitSet, so subsequent queries can use BitSet to retrieve the results of original query. Moreover, it can mark off documents that are already being grouped from the BitSet. In a page that shows 10 records/page, it will only group 10 records at a time. Consequently, there is no way to know the total number grouped records in the beginning. In addition, it feels like reading the field values from the document in order to look for group-by results is most time consuming. How does RDBMS do it? ray, On 8/31/05, kapilChhabra (sent by Nabble.com) <[EMAIL PROTECTED]> wrote: > > thanks a lot for your suggestion. > I'll try it and get back if need be. > > Meanwhile, I gave it a thought and concluded that the best time to do the > categorization/clustering should be lucene calculates Hits/in the Scrorer. > I am not sure if I am right. > In addition to the current functionality can we modify the Scorer class add > the following feature: > The class generates a 2 dimentional array for the clustered field, the first > dimention contains the distinct values of the field and the second dimention > contains the count of results under this field. This value is incremented for > an acceptible hit. > Does it make sense? > If it is possible, i'll dig deeper into the code of the Hits/Scorer classes. > > Thanks in advance, > kapilChhabra > > > -- > Sent from the Lucene - Java Users forum at Nabble.com: > http://www.nabble.com/Search-Results-Clustering-t249355.html#a748901 > >