[
https://issues.apache.org/jira/browse/SOLR-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Erick Erickson updated SOLR-1931:
---------------------------------
Attachment: SOLR-1931-trunk.patch
SOLR-1931-3x.patch
Well, there are a couple of issues here. I've attached patches for trunk and 3x
for consideration.
I fixed a structural flaw that traversed all the terms in all the fields twice,
once to get the total number of terms across all the fields and once to get the
individual counts.
But that's not where the bulk of the time gets spent. It turns out that getting
the count of documents in which each term appears is the culprit. These two
lines are executed for each field
Query q = new TermRangeQuery(fieldName, null, null, false, false);
TopDocs top = searcher.search(q, 1);
and top.totalHits is reported. I have an index with 99M documents, mostly
integer data that takes 360 seconds to return data when the above is executed
and 150 without. Both versions traverse all the terms once, so these times
would be greater without the patch due to the second traversal.
So the attached patches default to NOT doing the above and there's a new
parameter reportDocCount that can be set to true to collect that information.
What do people think? And is there a better way to get the count of documents
in which the term appears? And do any alternate methods respect deleted docs
like this one does?
I tried spinning through using TermDocs (3.6) but soon realized that the people
who wrote TermRangeQuery probably got there first.
So I guess my real question is whether people object to the change in behavior,
that users must explicitly request doc counts. Which also means that the
admin/schema browser doesn't report this by default and I haven't made it
optional from that interface. I'm not inclined to since that interface is going
away, but if people feel strongly I might be persuaded. That info is available
by admin/luke?fl=myfield&reportDocCount=true in a less painful fashion for a
particular field anyway.
Along the way I alphabetized the fields without my other kludge of putting
comparators in other classes. I'll kill that JIRA if this one goes forward.
Note that this still doesn't scale all that well, on my test index it's still a
5 minute wait. But then I guess that this kind of data gathering will take time
by its nature.
If nobody objects, I'll commit this early next week after I've had a chance to
put it down for a while and look at it with fresh eyes and do some more
testing. I think there's some inefficiencies in the single pass that I can
wring out (about 30 seconds is spent just gathering the data in the single term
enumeration loop).
> Schema Browser does not scale with large indexes
> ------------------------------------------------
>
> Key: SOLR-1931
> URL: https://issues.apache.org/jira/browse/SOLR-1931
> Project: Solr
> Issue Type: Improvement
> Components: web gui
> Affects Versions: 3.6, 4.0
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: SOLR-1931-3x.patch, SOLR-1931-trunk.patch
>
>
> The Schema Browser JSP by default causes the Luke handler to "scan the
> world". In large indexes this make the UI useless.
> On an index with 64m documents & 8gb of disk space, the Schema Browser took 6
> minutes to open and hogged all disk I/O, making Solr useless.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]