Please note, for now, that this problem is not relevant for us anymore, and we will change our c-field from being of type string (docValue) to being of type long (docValue). And faceting on huge numbers of long docValues seem to perform very well - except for https://issues.apache.org/jira/browse/SOLR-5444, but we have handled that now

I would like to help verifying that the string-faceting problem that this mailing-thread has been about, that it has been fixed in 4.5.1 - that things are performing better and no huge mem usage. In order to be able to do that I would really like to be able to deploy 4.5.1 on top of my 12 billion documents indexed with 4.4.0. Can anyone confirm that I ought to be able to do that? I have tried shortly but ran into problems. When trying to start Solr it says

[2013-11-08 17:45:48,829]ERROR [coreLoadExecutor-4-thread-19] [logid: ] - 
org.apache.solr.common.SolrException.log(SolrException.java:119) 
-null:org.apache.solr.common.SolrException: Unable to create core: 
mycoll_shard13_replica1
        at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:934)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:566)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.SolrException: Error openingnew  searcher
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:834)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:625)
        at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:256)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:555)
        ... 10 more
Caused by: org.apache.solr.common.SolrException: Error openingnew  searcher
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1477)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1589)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:821)
        ... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: Unknown format: 12, 
input=MMapIndexInput(path="/usr/lib/solr/data/mycoll_shard13_replica1/data/index/_1k63_Disk_0.dvdm")
        at 
org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.readNumericEntry(Lucene45DocValuesProducer.java:207)
        at 
org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.readFields(Lucene45DocValuesProducer.java:120)
        at 
org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.<init>(Lucene45DocValuesProducer.java:85)
        at 
org.apache.lucene.codecs.diskdv.DiskDocValuesProducer.<init>(DiskDocValuesProducer.java:31)
        at 
org.apache.lucene.codecs.diskdv.DiskDocValuesFormat.fieldsProducer(DiskDocValuesFormat.java:56)
        at 
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.<init>(PerFieldDocValuesFormat.java:215)
        at 
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat.fieldsProducer(PerFieldDocValuesFormat.java:300)
        at 
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:140)
        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:56)
        at 
org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:121)
        at 
org.apache.lucene.index.ReadersAndLiveDocs.getReadOnlyClone(ReadersAndLiveDocs.java:217)
        at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100)
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:379)
        at 
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:111)
        at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:41)
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1443)
        ... 15 more

Besides that, see comments below

On 11/14/13 7:54 PM, Joel Bernstein wrote:
Per,

As you are seeing there are different implementations for calculating facets for numeric fields and string fields. The numeric fields I believe are using an int-to-int or long-to-int hashmap to hold the facet counts. This map grows as values are added to it. The String version uses an int array the size of the number of distinct values in the field to hold the facet counts. So if you have a very large number of distinct values in the field, you'll have a very large array.
Do not think this part is a problem
Also the distinct values themselves are held in memory in the fieldCache for string fields.
Yes, that is probably a problem

Also note https://dl.dropboxusercontent.com/u/25718039/mem-dump-while-searching-on-facet.field-c_dstr_doc_sto.png and my comments on it in a mail earlier in this thread.

So, basically as you are seeing you'll take up a much larger memory footprint when when faceting on a high cardinality string field, then on a high cardinality numeric field.

There are docvalues faceting implementations that will kick-in on a field that has docvalues. You can try setting the on disk flag
Believe I did that for my string field "c_dstr_doc_sto"?
From schema.xml
<dynamicField name="**_dstr_doc_sto*" type="*dstring*" indexed="false" stored="true" required="true" docValues="true"/> <dynamicField name="*_lng_ind_sto" type="long" indexed="true" stored="true"/> <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false" stored="true" required="true" docValues="true"/>
...
<fieldType name="*dstring*" class="solr.StrField" sortMissingLast="true" *docValuesFormat="Disk"*/> <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0" docValuesFormat="Disk"/>

Did I miss something?
and this will test memory and performance.

Joel

Joel




On Thu, Nov 14, 2013 at 8:13 AM, Per Steffensen <[email protected] <mailto:[email protected]>> wrote:

    If anyone if following this one, just an update. We are not going
    to upgrade to 4.5.1 in order to see if the String facet
    performance problem has been fixed. Instead we have made a few
    hacks around our data so that we can store the c-field
    (c_dstr_doc_sto) as long instead (c_dlng_doc_sto). So now we only
    need to struggle with long-facet performance. There is a
    performance issue with facets on longs though, but I will tell
    about in another mailing-thread - need your input on what solution
    you prefer.

https://issues.apache.org/jira/browse/SOLR-5444


    Regards, Per Steffensen


Reply via email to