[jira] [Comment Edited] (SOLR-7867) implicit sharded, facet grouping problem with multivalued string field starting with digits

Jonathan Gonzalez (JIRA) Thu, 06 Aug 2015 16:55:14 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659224#comment-14659224
 ]


Jonathan Gonzalez edited comment on SOLR-7867 at 8/6/15 11:54 PM:
------------------------------------------------------------------

The problem rely on the docValues attribute, for some reason reading the dvd 
file fails after several incremental feedings (at least in my case),  I'm able 
to reproduce this problem either on SolrCloud and Standalone instance, the 
query has to have &group.facet=true and the facet field definition 
docValues=true.

A short-term fix: disable the docValues attribute (docValues=false).

Fields definition:
{code}
<field name="fieldForGrouping" type="int" indexed="true" stored="false" 
multiValued="false" omitNorms="true" termVectors="false" termPositions="false" 
docValues="false"/>
<field name="fieldForFacet" type="string" indexed="true" stored="true" 
multiValued="true" omitNorms="true" termVectors="false" termPositions="false" 
docValues="true"/>
{code}

Query:
The query is using &group.field=<fieldForGrouping>&group.facet=true and a 
simple facet like:
{code}
&facet.field={!key=FacetKey_12345678%20facet.prefix=12345678}fieldForFacet
{code}

The following image, shows Solr reading the index file of type dvd 
(Per-Document Values .dvd, .dvm - Encodes additional scoring factors or other 
per-document information. 
https://lucene.apache.org/core/5_2_0/core/org/apache/lucene/codecs/lucene50/Lucene50DocValuesFormat.html),
 enabled by the docValues=true. 
(https://cwiki.apache.org/confluence/display/solr/DocValues)
!ErrorReadingDocValues.PNG!

Then trying to read the facet.prefix value from this dvd file, there is an 
attempt to read more than the current buffer size causing this issue:
!DocValuesException.PNG!

Checking the index integrity it seems to be ok, so probably is something in the 
code reading the document values for numbers.

{code}
Opening index @ .........\\data\\index\\

Segments file=segments_6j numSegments=1 version=5.1.0 
id=9bsp5504j7u6jjf6gxl6zw0oo format= userData={commitTimeMSec=1438876474785}
  1 of 1: name=_dz maxDoc=801607
    version=5.1.0
    id=9bsp5504j7u6jjf6gxl6zw0on
    codec=Lucene50
    compound=false
    numFiles=10
    size (MB)=601.626
    diagnostics = {java.vendor=Oracle Corporation, java.version=1.7.0_67, 
lucene.version=5.1.0, mergeFactor=27, mergeMaxNumSegments=1, os=Windows 8.1, 
os.arch=amd64, os.version=6.3, source=merge, timestamp=1438876452570}
    no deletions
    test: open reader.........OK [took 0.104 sec]
    test: check integrity.....OK [took 0.800 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [103 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [3335590 terms; 188733439 terms/docs pairs; 
139381743 tokens] [took 7.999 sec]
    test: stored fields.......OK [84670070 total field count; avg 105.6 fields 
per doc] [took 11.457 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [28 docvalues fields; 0 BINARY; 15 NUMERIC; 6 
SORTED; 0 SORTED_NUMERIC; 7 SORTED_SET] [took 1.176 sec]

No problems were detected with this index.

Took 21.662 sec total.
{code}




was (Author: jonathan gv):
The problem rely on the docValues attribute, for some reason the dvd file 
becomes corrupted after several incremental feedings (at least in my case),  
I'm able to reproduce this problem either on SolrCloud and Standalone instance, 
the query has to have &group.facet=true and the facet field definition 
docValues=true.

A short-term fix: disable the docValues attribute (docValues=false).

Fields definition:
{code}
<field name="fieldForGrouping" type="int" indexed="true" stored="false" 
multiValued="false" omitNorms="true" termVectors="false" termPositions="false" 
docValues="false"/>
<field name="fieldForFacet" type="string" indexed="true" stored="true" 
multiValued="true" omitNorms="true" termVectors="false" termPositions="false" 
docValues="true"/>
{code}

Query:
The query is using &group.field=<fieldForGrouping>&group.facet=true and a 
simple facet like:
{code}
&facet.field={!key=FacetKey_12345678%20facet.prefix=12345678}fieldForFacet
{code}

The following image, shows Solr reading the index file of type dvd 
(Per-Document Values .dvd, .dvm - Encodes additional scoring factors or other 
per-document information. 
https://lucene.apache.org/core/5_2_0/core/org/apache/lucene/codecs/lucene50/Lucene50DocValuesFormat.html),
 enabled by the docValues=true. 
(https://cwiki.apache.org/confluence/display/solr/DocValues)
!ErrorReadingDocValues.PNG!

Then trying to read the facet.prefix value from this dvd file, there is an 
attempt to read more than the current buffer size causing this issue:
!DocValuesException.PNG!

I hope it helps!


> implicit sharded, facet grouping problem with multivalued string field 
> starting with digits
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7867
>                 URL: https://issues.apache.org/jira/browse/SOLR-7867
>             Project: Solr
>          Issue Type: Bug
>          Components: faceting, SolrCloud
>    Affects Versions: 5.2
>         Environment: 3.13.0-48-generic #80-Ubuntu SMP x86_64 GNU/Linux
> java version "1.7.0_80"
> Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
>            Reporter: Umut Erogul
>              Labels: docValues, facet, group, sharding
>         Attachments: DocValuesException.PNG, ErrorReadingDocValues.PNG
>
>
> related parts @ schema.xml:
> {code}<field name="keyword_ss" type="string" indexed="true" stored="true" 
> docValues="true" multiValued="true"/>
> <field name="author_s" type="string" indexed="true" stored="true" 
> docValues="true"/>{code}
> every document has valid author_s and keyword_ss fields;
> we can make successful facet group queries on single node, single collection, 
> solr-4.9.0 server
> {code}
> q: *:* fq: keyword_ss:3m
> facet=true&facet.field=keyword_ss&group=true&group.field=author_s&group.facet=true
> {code}
> when querying on solr-5.2.0 server with implicit sharded environment with:
> {code}<!-- router.field -->
> <field name="shard_name" type="string" indexed="true" stored="true" 
> required="true"/>{code}
> with example shard names; affinity1 affinity2 affinity3 affinity4
> the same query with same documents gets:
> {code}
> ERROR - 2015-08-04 08:15:15.222; [document affinity3 core_node32 
> document_affinity3_replica2] org.apache.solr.common.SolrException; 
> org.apache.solr.common.SolrException: Exception during facet.field: keyword_ss
>         at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:632)
>         at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:617)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> org.apache.solr.request.SimpleFacets$2.execute(SimpleFacets.java:571)
>         at 
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:642)
> ...
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>         at 
> org.apache.lucene.codecs.lucene50.Lucene50DocValuesProducer$CompressedBinaryDocValues$CompressedBinaryTermsEnum.readTerm(Lucene50DocValuesProducer.java:1008)
>         at 
> org.apache.lucene.codecs.lucene50.Lucene50DocValuesProducer$CompressedBinaryDocValues$CompressedBinaryTermsEnum.next(Lucene50DocValuesProducer.java:1026)
>         at 
> org.apache.lucene.search.grouping.term.TermGroupFacetCollector$MV$SegmentResult.nextTerm(TermGroupFacetCollector.java:373)
>         at 
> org.apache.lucene.search.grouping.AbstractGroupFacetCollector.mergeSegmentResults(AbstractGroupFacetCollector.java:91)
>         at 
> org.apache.solr.request.SimpleFacets.getGroupedCounts(SimpleFacets.java:541)
>         at 
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:463)
>         at 
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:386)
>         at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:626)
>         ... 33 more
> {code}
> all the problematic queries are caused by strings starting with digits; 
> ("3m", "8 saniye", "2 broke girls", "1v1y")
> there are some strings that the query works like ("24", "90+", "45 dakika")
> we do not observe the problem when querying with 
> -keyword_ss:(0-9)*
> updating the problematic documents (a small subset of keyword_ss:(0-9)*), 
> fixes the query, 
> but we cannot find an easy solution to find the problematic documents
> there is around 400m docs; seperated at 28 shards; 
> -keyword_ss:(0-9)* matches %97 of documents



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7867) implicit sharded, facet grouping problem with multivalued string field starting with digits

Reply via email to