Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Joel Bernstein
The idea of adding a terms.ttf parameter sounds fine to me. And It would be
good to get terms.list better integrated into the TermsComponent.  In
general I think it's time for more attention to be paid to the
TermsComponent.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Feb 22, 2017 at 4:12 PM, Shai Erera  wrote:

> Hmm .. so if I want to add totalTermFreq to the response, it will break
> the current output format of TermsComponent, which returns for each term
> only the docFreq. What's our BWC policy for such API and is there a way to
> handle it?
>
> I can add a new terms.ttf parameter, and so if you set it to true, the
> response will look different (each term will have both docFreq and
> totalTermFreq elements), but if you didn't, you will get the same response.
> Is that acceptable?
>
> Somewhat related, but can be handled separately, I noticed that if you
> specify terms.list and multiple terms.fl parameters, you only receive stats
> for the first field (the rest are ignored), but if you don't specify
> terms.list, you get results for all fields. I don't see any reason not to
> support multiple fields with terms list, what do you think?
>
> On Wed, Feb 22, 2017 at 10:08 PM Shai Erera  wrote:
>
>> Looks like this could be a very easy addition to TermsComponent? From
>> what I read in the code, it uses TermContext to compute/hold the stats, and
>> the latter already has docFreq and totalTermFreq (!!). It's just that
>> TermsComponent does not output TTF (only computes it...):
>>
>> for(int i=0; i>   if(termContexts[i] != null) {
>> String outTerm = fieldType.indexedToReadable(
>> terms[i].bytes().utf8ToString());
>> int docFreq = termContexts[i].docFreq();
>> termsMap.add(outTerm, docFreq);
>>   }
>> }
>>
>>
>> On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein 
>> wrote:
>>
>> Yeah, I think expanding the functionality of the terms component looks
>> like the right place to add these stats.
>>
>> I plan on exposing these types of terms stats as Streaming Expression
>> functions but I would likely use the terms component under the covers.
>>
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera  wrote:
>>
>> No, they are not global distributed stats. I am willing to live with
>> approximated stats though (unless again, there's an API which can give me
>> both). I wonder why doesn't Terms component return ttf in addition to
>> docfreq. The API (at the Lucene level) is right there already.
>>
>> On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein 
>> wrote:
>>
>> Hi Shai,
>>
>> Do ttf and docfreq return global stats in distributed mode? I wasn't
>> aware that there was a mechanism for aggregating values in the field list.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera  wrote:
>>
>> Hi
>>
>> I am currently using function queries to obtain these two statistics, as
>> I didn't see a better or more explicit API and the Terms component only
>> returns docFreq, but not totalTermFreq.
>>
>> The way I use the API is submit requests as follows:
>>
>> curl "http://localhost:8983/solr/mycollection/select?q=*:*;
>> rows=1=ttf(text,'t1'),docfreq(text,'t1')"
>>
>> Today I noticed that it sometimes returns 0 for these stats for existing
>> terms. After debugging and going through the code, I noticed that it
>> performs analysis on the value that's given. So if I provide an already
>> stemmed value, it analyzes the value further and in some cases it results
>> in a non-existing term (and in other cases I get stats for a term I didn't
>> ask for).
>>
>> I want to get the stats of the indexed version of the terms, and that's
>> why I send the already stemmed one. In my case I tried to get the stats for
>> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
>> it further analyzed the value to 'disgui' (per the analysis chain) and that
>> term does not exist in the index.
>>
>> So first question is -- is this the right API to retrieve such
>> statistics? I didn't find another one, but could be I missed it.
>>
>> If it is, why does it analyze the value? I tried to wrap the value with
>> single and double quotes, but of course that does not affect the analysis
>> ... is analysis an intended behavior or a bug?
>>
>> Shai
>>
>>
>>
>>


Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Shai Erera
Hmm .. so if I want to add totalTermFreq to the response, it will break the
current output format of TermsComponent, which returns for each term only
the docFreq. What's our BWC policy for such API and is there a way to
handle it?

I can add a new terms.ttf parameter, and so if you set it to true, the
response will look different (each term will have both docFreq and
totalTermFreq elements), but if you didn't, you will get the same response.
Is that acceptable?

Somewhat related, but can be handled separately, I noticed that if you
specify terms.list and multiple terms.fl parameters, you only receive stats
for the first field (the rest are ignored), but if you don't specify
terms.list, you get results for all fields. I don't see any reason not to
support multiple fields with terms list, what do you think?

On Wed, Feb 22, 2017 at 10:08 PM Shai Erera  wrote:

> Looks like this could be a very easy addition to TermsComponent? From what
> I read in the code, it uses TermContext to compute/hold the stats, and the
> latter already has docFreq and totalTermFreq (!!). It's just that
> TermsComponent does not output TTF (only computes it...):
>
> for(int i=0; i   if(termContexts[i] != null) {
> String outTerm =
> fieldType.indexedToReadable(terms[i].bytes().utf8ToString());
> int docFreq = termContexts[i].docFreq();
> termsMap.add(outTerm, docFreq);
>   }
> }
>
>
> On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein  wrote:
>
> Yeah, I think expanding the functionality of the terms component looks
> like the right place to add these stats.
>
> I plan on exposing these types of terms stats as Streaming Expression
> functions but I would likely use the terms component under the covers.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera  wrote:
>
> No, they are not global distributed stats. I am willing to live with
> approximated stats though (unless again, there's an API which can give me
> both). I wonder why doesn't Terms component return ttf in addition to
> docfreq. The API (at the Lucene level) is right there already.
>
> On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein  wrote:
>
> Hi Shai,
>
> Do ttf and docfreq return global stats in distributed mode? I wasn't aware
> that there was a mechanism for aggregating values in the field list.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera  wrote:
>
> Hi
>
> I am currently using function queries to obtain these two statistics, as I
> didn't see a better or more explicit API and the Terms component only
> returns docFreq, but not totalTermFreq.
>
> The way I use the API is submit requests as follows:
>
> curl "
> http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1
> ')"
>
> Today I noticed that it sometimes returns 0 for these stats for existing
> terms. After debugging and going through the code, I noticed that it
> performs analysis on the value that's given. So if I provide an already
> stemmed value, it analyzes the value further and in some cases it results
> in a non-existing term (and in other cases I get stats for a term I didn't
> ask for).
>
> I want to get the stats of the indexed version of the terms, and that's
> why I send the already stemmed one. In my case I tried to get the stats for
> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
> it further analyzed the value to 'disgui' (per the analysis chain) and that
> term does not exist in the index.
>
> So first question is -- is this the right API to retrieve such statistics?
> I didn't find another one, but could be I missed it.
>
> If it is, why does it analyze the value? I tried to wrap the value with
> single and double quotes, but of course that does not affect the analysis
> ... is analysis an intended behavior or a bug?
>
> Shai
>
>
>
>


Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Shai Erera
Looks like this could be a very easy addition to TermsComponent? From what
I read in the code, it uses TermContext to compute/hold the stats, and the
latter already has docFreq and totalTermFreq (!!). It's just that
TermsComponent does not output TTF (only computes it...):

for(int i=0; i wrote:

> Yeah, I think expanding the functionality of the terms component looks
> like the right place to add these stats.
>
> I plan on exposing these types of terms stats as Streaming Expression
> functions but I would likely use the terms component under the covers.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera  wrote:
>
> No, they are not global distributed stats. I am willing to live with
> approximated stats though (unless again, there's an API which can give me
> both). I wonder why doesn't Terms component return ttf in addition to
> docfreq. The API (at the Lucene level) is right there already.
>
> On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein  wrote:
>
> Hi Shai,
>
> Do ttf and docfreq return global stats in distributed mode? I wasn't aware
> that there was a mechanism for aggregating values in the field list.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera  wrote:
>
> Hi
>
> I am currently using function queries to obtain these two statistics, as I
> didn't see a better or more explicit API and the Terms component only
> returns docFreq, but not totalTermFreq.
>
> The way I use the API is submit requests as follows:
>
> curl "
> http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1
> ')"
>
> Today I noticed that it sometimes returns 0 for these stats for existing
> terms. After debugging and going through the code, I noticed that it
> performs analysis on the value that's given. So if I provide an already
> stemmed value, it analyzes the value further and in some cases it results
> in a non-existing term (and in other cases I get stats for a term I didn't
> ask for).
>
> I want to get the stats of the indexed version of the terms, and that's
> why I send the already stemmed one. In my case I tried to get the stats for
> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
> it further analyzed the value to 'disgui' (per the analysis chain) and that
> term does not exist in the index.
>
> So first question is -- is this the right API to retrieve such statistics?
> I didn't find another one, but could be I missed it.
>
> If it is, why does it analyze the value? I tried to wrap the value with
> single and double quotes, but of course that does not affect the analysis
> ... is analysis an intended behavior or a bug?
>
> Shai
>
>
>
>


Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Joel Bernstein
Yeah, I think expanding the functionality of the terms component looks like
the right place to add these stats.

I plan on exposing these types of terms stats as Streaming Expression
functions but I would likely use the terms component under the covers.



Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera  wrote:

> No, they are not global distributed stats. I am willing to live with
> approximated stats though (unless again, there's an API which can give me
> both). I wonder why doesn't Terms component return ttf in addition to
> docfreq. The API (at the Lucene level) is right there already.
>
> On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein  wrote:
>
>> Hi Shai,
>>
>> Do ttf and docfreq return global stats in distributed mode? I wasn't
>> aware that there was a mechanism for aggregating values in the field list.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera  wrote:
>>
>> Hi
>>
>> I am currently using function queries to obtain these two statistics, as
>> I didn't see a better or more explicit API and the Terms component only
>> returns docFreq, but not totalTermFreq.
>>
>> The way I use the API is submit requests as follows:
>>
>> curl "http://localhost:8983/solr/mycollection/select?q=*:*;
>> rows=1=ttf(text,'t1'),docfreq(text,'t1')"
>>
>> Today I noticed that it sometimes returns 0 for these stats for existing
>> terms. After debugging and going through the code, I noticed that it
>> performs analysis on the value that's given. So if I provide an already
>> stemmed value, it analyzes the value further and in some cases it results
>> in a non-existing term (and in other cases I get stats for a term I didn't
>> ask for).
>>
>> I want to get the stats of the indexed version of the terms, and that's
>> why I send the already stemmed one. In my case I tried to get the stats for
>> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
>> it further analyzed the value to 'disgui' (per the analysis chain) and that
>> term does not exist in the index.
>>
>> So first question is -- is this the right API to retrieve such
>> statistics? I didn't find another one, but could be I missed it.
>>
>> If it is, why does it analyze the value? I tried to wrap the value with
>> single and double quotes, but of course that does not affect the analysis
>> ... is analysis an intended behavior or a bug?
>>
>> Shai
>>
>>
>>


Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Shai Erera
No, they are not global distributed stats. I am willing to live with
approximated stats though (unless again, there's an API which can give me
both). I wonder why doesn't Terms component return ttf in addition to
docfreq. The API (at the Lucene level) is right there already.

On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein  wrote:

> Hi Shai,
>
> Do ttf and docfreq return global stats in distributed mode? I wasn't aware
> that there was a mechanism for aggregating values in the field list.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera  wrote:
>
> Hi
>
> I am currently using function queries to obtain these two statistics, as I
> didn't see a better or more explicit API and the Terms component only
> returns docFreq, but not totalTermFreq.
>
> The way I use the API is submit requests as follows:
>
> curl "
> http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1
> ')"
>
> Today I noticed that it sometimes returns 0 for these stats for existing
> terms. After debugging and going through the code, I noticed that it
> performs analysis on the value that's given. So if I provide an already
> stemmed value, it analyzes the value further and in some cases it results
> in a non-existing term (and in other cases I get stats for a term I didn't
> ask for).
>
> I want to get the stats of the indexed version of the terms, and that's
> why I send the already stemmed one. In my case I tried to get the stats for
> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
> it further analyzed the value to 'disgui' (per the analysis chain) and that
> term does not exist in the index.
>
> So first question is -- is this the right API to retrieve such statistics?
> I didn't find another one, but could be I missed it.
>
> If it is, why does it analyze the value? I tried to wrap the value with
> single and double quotes, but of course that does not affect the analysis
> ... is analysis an intended behavior or a bug?
>
> Shai
>
>
>


Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Joel Bernstein
Hi Shai,

Do ttf and docfreq return global stats in distributed mode? I wasn't aware
that there was a mechanism for aggregating values in the field list.


Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera  wrote:

> Hi
>
> I am currently using function queries to obtain these two statistics, as I
> didn't see a better or more explicit API and the Terms component only
> returns docFreq, but not totalTermFreq.
>
> The way I use the API is submit requests as follows:
>
> curl "http://localhost:8983/solr/mycollection/select?q=*:*=
> 1=ttf(text,'t1'),docfreq(text,'t1')"
>
> Today I noticed that it sometimes returns 0 for these stats for existing
> terms. After debugging and going through the code, I noticed that it
> performs analysis on the value that's given. So if I provide an already
> stemmed value, it analyzes the value further and in some cases it results
> in a non-existing term (and in other cases I get stats for a term I didn't
> ask for).
>
> I want to get the stats of the indexed version of the terms, and that's
> why I send the already stemmed one. In my case I tried to get the stats for
> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
> it further analyzed the value to 'disgui' (per the analysis chain) and that
> term does not exist in the index.
>
> So first question is -- is this the right API to retrieve such statistics?
> I didn't find another one, but could be I missed it.
>
> If it is, why does it analyze the value? I tried to wrap the value with
> single and double quotes, but of course that does not affect the analysis
> ... is analysis an intended behavior or a bug?
>
> Shai
>


Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Shai Erera
Hi

I am currently using function queries to obtain these two statistics, as I
didn't see a better or more explicit API and the Terms component only
returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "
http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1
')"

Today I noticed that it sometimes returns 0 for these stats for existing
terms. After debugging and going through the code, I noticed that it
performs analysis on the value that's given. So if I provide an already
stemmed value, it analyzes the value further and in some cases it results
in a non-existing term (and in other cases I get stats for a term I didn't
ask for).

I want to get the stats of the indexed version of the terms, and that's why
I send the already stemmed one. In my case I tried to get the stats for the
term 'disguis' which is the stem of 'disguise' and 'disguised', however it
further analyzed the value to 'disgui' (per the analysis chain) and that
term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics?
I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with
single and double quotes, but of course that does not affect the analysis
... is analysis an intended behavior or a bug?

Shai