Danilo Tomasoni created SOLR-12731:
--------------------------------------

             Summary: SynonimGraphFilter expands wrong synonims
                 Key: SOLR-12731
                 URL: https://issues.apache.org/jira/browse/SOLR-12731
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: search
    Affects Versions: 7.3.1
         Environment: Ubuntu 16.04.5 LTS
            Reporter: Danilo Tomasoni


Hello to all I have an issue related to synonimgraphfilter expanding the wrong 
synonims for a phrase-term at query time.

I have a dictionary with the following lines
{code:java}
P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 5'-nucleotidase II
A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid 
3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo sapiens 
glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, FLJ93688\, Homo 
sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA
{code}
and two documents
{code:java}
{"body": "8. The method of claim 6 wherein said method inhibits at least one 
5′-nucleotidase chosen from cytosolic 5′-nucleotidase II (cN-II), cytosolic 
5′-nucleotidase IA (cN-IA), cytosolic 5′-nucleotidase IB (cN-IB), cytosolic 
5′-nucleotidase IMA (cN-IIIA), cytosolic 5′-nucleotidase NIB (cN-IIIB), 
ecto-5′-nucleotidase (eN, CD73), cytosolic 5′(3′)-deoxynucleotidase (cdN) and 
mitochondrial 5′(3′)-deoxynucleotidase (mdN)."}
{"body": "Trichomonosis caused by the flagellate protozoan Trichomonas 
vaginalis represents the most prevalent nonviral sexually transmitted disease 
worldwide (WHO-DRHR 2012). In women, the symptoms are cyclic and often worsen 
around the menstruation period. In men, trichomonosis is largely asymptomatic 
and these men are considered to be carriers of T. vaginalis (Petrin et al. 
1998). This infection has been associated with birth outcomes (Klebanoff et al. 
2001), infertility (Grodstein et al. 1993), cervical and prostate cancer 
(Viikki et al. 2000, Sutcliffe et al. 2012) and pelvic inflammatory disease 
(Cherpes et al. 2006). Importantly, T. vaginalis is a co-factor in human 
immunodeficiency virus transmission and acquisition (Sorvillo et al. 2001, Van 
Der Pol et al. 2008). Therefore, it is important to study the host-parasite 
relationship to understand T. vaginalis infection and pathogenesis. 
Colonisation of the mucosa by T. vaginalis is a complex multi-step process that 
involves distinct mechanisms (Alderete et al. 2004). The parasite interacts 
with mucin (Lehker & Sweeney 1999), adheres to vaginal epithelial cells (VECs) 
in a process mediated by adhesion proteins (AP120, AP65, AP51, AP33 and AP23) 
and undergoes dramatic morphological changes from a pyriform to an amoeboid 
form (Engbring & Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et al. 
2005). After adhesion to VECs, the synthesis and gene expression of adhesins 
are increased (Kucknoor et al. 2005). These mechanisms must be tightly 
regulated and iron plays a pivotal role in this regulation. Iron is an 
essential element for all living organisms, from the most primitive to the most 
complex, as a component of haeme, iron-sulphur clusters and a variety of 
proteins. Iron is known to contribute to biological functions such as DNA and 
RNA synthesis, oxygen transport and metabolic reactions. T. vaginalis has 
developed multiple iron uptake systems such as receptors for hololactoferrin, 
haemoglobin (HB), haemin (HM) and haeme binding as well as adhesins to 
erythrocytes and epithelial cells (Moreno-Brito et al. 2005, Ardalan et al. 
2009). Iron plays a crucial role in the pathogenesis of trichomonosis by 
increasing cytoadherence and modulating resistance to complement lyses, 
ligation to the extracellular matrix and the expression of proteases 
(Figueroa-Angulo et al. 2012). In agreement with this role, the symptoms of 
trichomonosis worsen after menstruation. In addition, iron also influences 
nucleotide hydrolysis in T. vaginalis (Tasca et al. 2005, de Jesus et al. 
2006). The extracellular concentrations of ATP and adenosine can markedly 
increase under several conditions such as inflammation and hypoxia as well as 
in the presence of pathogens (Robson et al. 2006, Sansom 2012). In the 
extracellular medium, these nucleotides can act as immunomodulators by 
triggering immunological effects. Extracellular ATP acts as a proinflammatory 
immune-mediator by triggering multiple immunological effects on cell types such 
as neutrophils, macrophages, dendritic cells and lymphocytes (Bours et al. 
2006). In this sense, ATP and adenosine concentrations in the extracellular 
compartment are controlled by ectoenzymes, including those of the nucleoside 
triphosphate diphosphohydrolase (NTPDase) (EC: 3.1.4.1) family, which hydrolyze 
tri and diphosphates and ecto-5’-nucleotidase (EC: 3.1.3.5), which hydrolyses 
monophosphates (Zimmermann 2001). Considering that de novo nucleotide synthesis 
is absent in T. vaginalis (Heyworth et al. 1982, 1984), this enzyme cascade is 
important as a source of the precursor adenosine for purine synthesis in the 
parasite (Munagala & Wang 2003). Extracellular nucleotide metabolism has been 
characterised in several parasite species such as Toxoplasma gondii, 
Schistosoma mansoni, Leishmania spp, Trypanosoma cruzi, Acanthamoeba, Entamoeba 
histolytica, Giardia lamblia and fungi, Saccharomyces cerevisiae, Cryptococcus 
neoformans, Candida parapsilosis and Candida albicans (Sansom 2012). In T. 
vaginalis , NTPDase and ecto-5’-nucleotidase activities have been characterised 
and they are involved in host-parasite interactions by controlling ATP and 
adenosine levels (Matos et al. 2001, d, de Jesus et al. 2002, Tasca et al. 
2003). Considering that (i) iron plays a crucial role in the pathogenesis of 
trichomonosis, (ii) ATP exerts a proinflammatory effect in inflammation, (iii) 
adenosine is important to T. vaginalis growth and acts as an antiinflammatory 
factor (Frasson et al. 2012) and (iv) ectonucleotidases modulate the nucleotide 
levels at infection sites (such as those observed in trichomonosis), the aim of 
this study was to investigate the effect of iron on the extracellular 
nucleotide hydrolysis and gene expression of T . vaginalis."}
{code}
Body has the type "text_en" configured in this way
{code:java}
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
        />
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>
{code}
the two dictionary lines are in the file "synonyms.txt".

If in a solr instance configured this way with those documents and I run the 
following query
{code:java}
(body:"Cytosolic 5'-nucleotidase II" OR body:"EC 3.1.3.5") 
{code}
both documents are returned.

Surprisingly, if I run the query
{code:java}
(body:"Cytosolic 5'-nucleotidase II") 
{code}
the second one is not returned.

If I set debugQuery=true I see that the second line is expanded
{code:java}
A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid 
3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo sapiens 
glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, FLJ93688\, Homo 
sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA
{code}
instead of the first
{code:java}
P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 5'-nucleotidase II
{code}
The parsed query (given by debugquery) is
{code:java}
"parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, 
spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 0, true), 
spanNear([body:cytosolic,, body:isoform, body:cra_b], 0, true), 
spanNear([body:cdna, body:flj78196,, body:highli, body:similar, body:to, 
body:homo, body:sapien, body:glucosidase,, body:beta,, body:acid, body:3], 0, 
true), body:cytosol, spanNear([body:gba3,, body:mrna], 0, true), 
spanNear([body:cdna,, body:flj93688,, body:homo, body:sapien, 
body:glucosidase,, body:beta,, body:acid, body:3], 0, true), body:cytosol]), 
body:5, body:nucleotidas, body:ii], 0, true))
{code}
If I remove the second line, no synonym is expanded
{code:java}
    "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas ii\")",
{code}
I think this is related to the word "cytosolic" that appears as a synonim for 
the second line. If I remove cytosolic as a synonim from the second line, then 
again no synonym is expanded.

Can you tell me why this happens? I thought that the first line should be 
expanded since it has a multi-word synonym in it that match exactly the phrase 
query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to