[ 
https://issues.apache.org/jira/browse/SOLR-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-1931:
---------------------------------

    Attachment: SOLR-1931-trunk.patch
                SOLR-1931-3x.patch

Well, there are a couple of issues here. I've attached patches for trunk and 3x 
for consideration.

I fixed a structural flaw that traversed all the terms in all the fields twice, 
once to get the total number of terms across all the fields and once to get the 
individual counts.

But that's not where the bulk of the time gets spent. It turns out that getting 
the count of documents in which each term appears is the culprit. These two 
lines are executed for each field
  Query q = new TermRangeQuery(fieldName, null, null, false, false);
  TopDocs top = searcher.search(q, 1);

and top.totalHits is reported. I have an index with 99M documents, mostly 
integer data that takes 360 seconds to return data when the above is executed 
and 150 without. Both versions traverse all the terms once, so these times 
would be greater without the patch due to the second traversal.

So the attached patches default to NOT doing the above and there's a new 
parameter reportDocCount that can be set to true to collect that information. 
What do people think? And is there a better way to get the count of documents 
in which the term appears? And do any alternate methods respect deleted docs 
like this one does?

I tried spinning through using TermDocs (3.6) but soon realized that the people 
who wrote TermRangeQuery probably got there first.

So I guess my real question is whether people object to the change in behavior, 
that users must explicitly request doc counts. Which also means that the 
admin/schema browser doesn't report this by default and I haven't made it 
optional from that interface. I'm not inclined to since that interface is going 
away, but if people feel strongly I might be persuaded. That info is available 
by admin/luke?fl=myfield&reportDocCount=true in a less painful fashion for a 
particular field anyway.

Along the way I alphabetized the fields without my other kludge of putting 
comparators in other classes. I'll kill that JIRA if this one goes forward.

Note that this still doesn't scale all that well, on my test index it's still a 
5 minute wait. But then I guess that this kind of data gathering will take time 
by its nature.

If nobody objects, I'll commit this early next week after I've had a chance to 
put it down for a while and look at it with fresh eyes and do some more 
testing. I think there's some inefficiencies in the single pass that I can 
wring out (about 30 seconds is spent just gathering the data in the single term 
enumeration loop).
                
> Schema Browser does not scale with large indexes
> ------------------------------------------------
>
>                 Key: SOLR-1931
>                 URL: https://issues.apache.org/jira/browse/SOLR-1931
>             Project: Solr
>          Issue Type: Improvement
>          Components: web gui
>    Affects Versions: 3.6, 4.0
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: SOLR-1931-3x.patch, SOLR-1931-trunk.patch
>
>
> The Schema  Browser JSP by default causes the Luke handler to "scan the 
> world". In large indexes this make the UI useless.
> On an index with 64m documents & 8gb of disk space, the Schema Browser took 6 
> minutes to open and hogged all disk I/O, making Solr useless.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to