Re: Getting totalTermFreq and docFreq for terms
The idea of adding a terms.ttf parameter sounds fine to me. And It would be good to get terms.list better integrated into the TermsComponent. In general I think it's time for more attention to be paid to the TermsComponent. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Feb 22, 2017 at 4:12 PM, Shai Ererawrote: > Hmm .. so if I want to add totalTermFreq to the response, it will break > the current output format of TermsComponent, which returns for each term > only the docFreq. What's our BWC policy for such API and is there a way to > handle it? > > I can add a new terms.ttf parameter, and so if you set it to true, the > response will look different (each term will have both docFreq and > totalTermFreq elements), but if you didn't, you will get the same response. > Is that acceptable? > > Somewhat related, but can be handled separately, I noticed that if you > specify terms.list and multiple terms.fl parameters, you only receive stats > for the first field (the rest are ignored), but if you don't specify > terms.list, you get results for all fields. I don't see any reason not to > support multiple fields with terms list, what do you think? > > On Wed, Feb 22, 2017 at 10:08 PM Shai Erera wrote: > >> Looks like this could be a very easy addition to TermsComponent? From >> what I read in the code, it uses TermContext to compute/hold the stats, and >> the latter already has docFreq and totalTermFreq (!!). It's just that >> TermsComponent does not output TTF (only computes it...): >> >> for(int i=0; i > if(termContexts[i] != null) { >> String outTerm = fieldType.indexedToReadable( >> terms[i].bytes().utf8ToString()); >> int docFreq = termContexts[i].docFreq(); >> termsMap.add(outTerm, docFreq); >> } >> } >> >> >> On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein >> wrote: >> >> Yeah, I think expanding the functionality of the terms component looks >> like the right place to add these stats. >> >> I plan on exposing these types of terms stats as Streaming Expression >> functions but I would likely use the terms component under the covers. >> >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera wrote: >> >> No, they are not global distributed stats. I am willing to live with >> approximated stats though (unless again, there's an API which can give me >> both). I wonder why doesn't Terms component return ttf in addition to >> docfreq. The API (at the Lucene level) is right there already. >> >> On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein >> wrote: >> >> Hi Shai, >> >> Do ttf and docfreq return global stats in distributed mode? I wasn't >> aware that there was a mechanism for aggregating values in the field list. >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera wrote: >> >> Hi >> >> I am currently using function queries to obtain these two statistics, as >> I didn't see a better or more explicit API and the Terms component only >> returns docFreq, but not totalTermFreq. >> >> The way I use the API is submit requests as follows: >> >> curl "http://localhost:8983/solr/mycollection/select?q=*:*; >> rows=1=ttf(text,'t1'),docfreq(text,'t1')" >> >> Today I noticed that it sometimes returns 0 for these stats for existing >> terms. After debugging and going through the code, I noticed that it >> performs analysis on the value that's given. So if I provide an already >> stemmed value, it analyzes the value further and in some cases it results >> in a non-existing term (and in other cases I get stats for a term I didn't >> ask for). >> >> I want to get the stats of the indexed version of the terms, and that's >> why I send the already stemmed one. In my case I tried to get the stats for >> the term 'disguis' which is the stem of 'disguise' and 'disguised', however >> it further analyzed the value to 'disgui' (per the analysis chain) and that >> term does not exist in the index. >> >> So first question is -- is this the right API to retrieve such >> statistics? I didn't find another one, but could be I missed it. >> >> If it is, why does it analyze the value? I tried to wrap the value with >> single and double quotes, but of course that does not affect the analysis >> ... is analysis an intended behavior or a bug? >> >> Shai >> >> >> >>
Re: Getting totalTermFreq and docFreq for terms
Hmm .. so if I want to add totalTermFreq to the response, it will break the current output format of TermsComponent, which returns for each term only the docFreq. What's our BWC policy for such API and is there a way to handle it? I can add a new terms.ttf parameter, and so if you set it to true, the response will look different (each term will have both docFreq and totalTermFreq elements), but if you didn't, you will get the same response. Is that acceptable? Somewhat related, but can be handled separately, I noticed that if you specify terms.list and multiple terms.fl parameters, you only receive stats for the first field (the rest are ignored), but if you don't specify terms.list, you get results for all fields. I don't see any reason not to support multiple fields with terms list, what do you think? On Wed, Feb 22, 2017 at 10:08 PM Shai Ererawrote: > Looks like this could be a very easy addition to TermsComponent? From what > I read in the code, it uses TermContext to compute/hold the stats, and the > latter already has docFreq and totalTermFreq (!!). It's just that > TermsComponent does not output TTF (only computes it...): > > for(int i=0; i if(termContexts[i] != null) { > String outTerm = > fieldType.indexedToReadable(terms[i].bytes().utf8ToString()); > int docFreq = termContexts[i].docFreq(); > termsMap.add(outTerm, docFreq); > } > } > > > On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein wrote: > > Yeah, I think expanding the functionality of the terms component looks > like the right place to add these stats. > > I plan on exposing these types of terms stats as Streaming Expression > functions but I would likely use the terms component under the covers. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera wrote: > > No, they are not global distributed stats. I am willing to live with > approximated stats though (unless again, there's an API which can give me > both). I wonder why doesn't Terms component return ttf in addition to > docfreq. The API (at the Lucene level) is right there already. > > On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein wrote: > > Hi Shai, > > Do ttf and docfreq return global stats in distributed mode? I wasn't aware > that there was a mechanism for aggregating values in the field list. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera wrote: > > Hi > > I am currently using function queries to obtain these two statistics, as I > didn't see a better or more explicit API and the Terms component only > returns docFreq, but not totalTermFreq. > > The way I use the API is submit requests as follows: > > curl " > http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1 > ')" > > Today I noticed that it sometimes returns 0 for these stats for existing > terms. After debugging and going through the code, I noticed that it > performs analysis on the value that's given. So if I provide an already > stemmed value, it analyzes the value further and in some cases it results > in a non-existing term (and in other cases I get stats for a term I didn't > ask for). > > I want to get the stats of the indexed version of the terms, and that's > why I send the already stemmed one. In my case I tried to get the stats for > the term 'disguis' which is the stem of 'disguise' and 'disguised', however > it further analyzed the value to 'disgui' (per the analysis chain) and that > term does not exist in the index. > > So first question is -- is this the right API to retrieve such statistics? > I didn't find another one, but could be I missed it. > > If it is, why does it analyze the value? I tried to wrap the value with > single and double quotes, but of course that does not affect the analysis > ... is analysis an intended behavior or a bug? > > Shai > > > >
Re: Getting totalTermFreq and docFreq for terms
Looks like this could be a very easy addition to TermsComponent? From what I read in the code, it uses TermContext to compute/hold the stats, and the latter already has docFreq and totalTermFreq (!!). It's just that TermsComponent does not output TTF (only computes it...): for(int i=0; iwrote: > Yeah, I think expanding the functionality of the terms component looks > like the right place to add these stats. > > I plan on exposing these types of terms stats as Streaming Expression > functions but I would likely use the terms component under the covers. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera wrote: > > No, they are not global distributed stats. I am willing to live with > approximated stats though (unless again, there's an API which can give me > both). I wonder why doesn't Terms component return ttf in addition to > docfreq. The API (at the Lucene level) is right there already. > > On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein wrote: > > Hi Shai, > > Do ttf and docfreq return global stats in distributed mode? I wasn't aware > that there was a mechanism for aggregating values in the field list. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera wrote: > > Hi > > I am currently using function queries to obtain these two statistics, as I > didn't see a better or more explicit API and the Terms component only > returns docFreq, but not totalTermFreq. > > The way I use the API is submit requests as follows: > > curl " > http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1 > ')" > > Today I noticed that it sometimes returns 0 for these stats for existing > terms. After debugging and going through the code, I noticed that it > performs analysis on the value that's given. So if I provide an already > stemmed value, it analyzes the value further and in some cases it results > in a non-existing term (and in other cases I get stats for a term I didn't > ask for). > > I want to get the stats of the indexed version of the terms, and that's > why I send the already stemmed one. In my case I tried to get the stats for > the term 'disguis' which is the stem of 'disguise' and 'disguised', however > it further analyzed the value to 'disgui' (per the analysis chain) and that > term does not exist in the index. > > So first question is -- is this the right API to retrieve such statistics? > I didn't find another one, but could be I missed it. > > If it is, why does it analyze the value? I tried to wrap the value with > single and double quotes, but of course that does not affect the analysis > ... is analysis an intended behavior or a bug? > > Shai > > > >
Re: Getting totalTermFreq and docFreq for terms
Yeah, I think expanding the functionality of the terms component looks like the right place to add these stats. I plan on exposing these types of terms stats as Streaming Expression functions but I would likely use the terms component under the covers. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Feb 22, 2017 at 8:56 AM, Shai Ererawrote: > No, they are not global distributed stats. I am willing to live with > approximated stats though (unless again, there's an API which can give me > both). I wonder why doesn't Terms component return ttf in addition to > docfreq. The API (at the Lucene level) is right there already. > > On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein wrote: > >> Hi Shai, >> >> Do ttf and docfreq return global stats in distributed mode? I wasn't >> aware that there was a mechanism for aggregating values in the field list. >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera wrote: >> >> Hi >> >> I am currently using function queries to obtain these two statistics, as >> I didn't see a better or more explicit API and the Terms component only >> returns docFreq, but not totalTermFreq. >> >> The way I use the API is submit requests as follows: >> >> curl "http://localhost:8983/solr/mycollection/select?q=*:*; >> rows=1=ttf(text,'t1'),docfreq(text,'t1')" >> >> Today I noticed that it sometimes returns 0 for these stats for existing >> terms. After debugging and going through the code, I noticed that it >> performs analysis on the value that's given. So if I provide an already >> stemmed value, it analyzes the value further and in some cases it results >> in a non-existing term (and in other cases I get stats for a term I didn't >> ask for). >> >> I want to get the stats of the indexed version of the terms, and that's >> why I send the already stemmed one. In my case I tried to get the stats for >> the term 'disguis' which is the stem of 'disguise' and 'disguised', however >> it further analyzed the value to 'disgui' (per the analysis chain) and that >> term does not exist in the index. >> >> So first question is -- is this the right API to retrieve such >> statistics? I didn't find another one, but could be I missed it. >> >> If it is, why does it analyze the value? I tried to wrap the value with >> single and double quotes, but of course that does not affect the analysis >> ... is analysis an intended behavior or a bug? >> >> Shai >> >> >>
Re: Getting totalTermFreq and docFreq for terms
No, they are not global distributed stats. I am willing to live with approximated stats though (unless again, there's an API which can give me both). I wonder why doesn't Terms component return ttf in addition to docfreq. The API (at the Lucene level) is right there already. On Wed, Feb 22, 2017 at 3:49 PM Joel Bernsteinwrote: > Hi Shai, > > Do ttf and docfreq return global stats in distributed mode? I wasn't aware > that there was a mechanism for aggregating values in the field list. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera wrote: > > Hi > > I am currently using function queries to obtain these two statistics, as I > didn't see a better or more explicit API and the Terms component only > returns docFreq, but not totalTermFreq. > > The way I use the API is submit requests as follows: > > curl " > http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1 > ')" > > Today I noticed that it sometimes returns 0 for these stats for existing > terms. After debugging and going through the code, I noticed that it > performs analysis on the value that's given. So if I provide an already > stemmed value, it analyzes the value further and in some cases it results > in a non-existing term (and in other cases I get stats for a term I didn't > ask for). > > I want to get the stats of the indexed version of the terms, and that's > why I send the already stemmed one. In my case I tried to get the stats for > the term 'disguis' which is the stem of 'disguise' and 'disguised', however > it further analyzed the value to 'disgui' (per the analysis chain) and that > term does not exist in the index. > > So first question is -- is this the right API to retrieve such statistics? > I didn't find another one, but could be I missed it. > > If it is, why does it analyze the value? I tried to wrap the value with > single and double quotes, but of course that does not affect the analysis > ... is analysis an intended behavior or a bug? > > Shai > > >
Re: Getting totalTermFreq and docFreq for terms
Hi Shai, Do ttf and docfreq return global stats in distributed mode? I wasn't aware that there was a mechanism for aggregating values in the field list. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Feb 22, 2017 at 7:18 AM, Shai Ererawrote: > Hi > > I am currently using function queries to obtain these two statistics, as I > didn't see a better or more explicit API and the Terms component only > returns docFreq, but not totalTermFreq. > > The way I use the API is submit requests as follows: > > curl "http://localhost:8983/solr/mycollection/select?q=*:*= > 1=ttf(text,'t1'),docfreq(text,'t1')" > > Today I noticed that it sometimes returns 0 for these stats for existing > terms. After debugging and going through the code, I noticed that it > performs analysis on the value that's given. So if I provide an already > stemmed value, it analyzes the value further and in some cases it results > in a non-existing term (and in other cases I get stats for a term I didn't > ask for). > > I want to get the stats of the indexed version of the terms, and that's > why I send the already stemmed one. In my case I tried to get the stats for > the term 'disguis' which is the stem of 'disguise' and 'disguised', however > it further analyzed the value to 'disgui' (per the analysis chain) and that > term does not exist in the index. > > So first question is -- is this the right API to retrieve such statistics? > I didn't find another one, but could be I missed it. > > If it is, why does it analyze the value? I tried to wrap the value with > single and double quotes, but of course that does not affect the analysis > ... is analysis an intended behavior or a bug? > > Shai >
Getting totalTermFreq and docFreq for terms
Hi I am currently using function queries to obtain these two statistics, as I didn't see a better or more explicit API and the Terms component only returns docFreq, but not totalTermFreq. The way I use the API is submit requests as follows: curl " http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1 ')" Today I noticed that it sometimes returns 0 for these stats for existing terms. After debugging and going through the code, I noticed that it performs analysis on the value that's given. So if I provide an already stemmed value, it analyzes the value further and in some cases it results in a non-existing term (and in other cases I get stats for a term I didn't ask for). I want to get the stats of the indexed version of the terms, and that's why I send the already stemmed one. In my case I tried to get the stats for the term 'disguis' which is the stem of 'disguise' and 'disguised', however it further analyzed the value to 'disgui' (per the analysis chain) and that term does not exist in the index. So first question is -- is this the right API to retrieve such statistics? I didn't find another one, but could be I missed it. If it is, why does it analyze the value? I tried to wrap the value with single and double quotes, but of course that does not affect the analysis ... is analysis an intended behavior or a bug? Shai