Re: Optimize facets when actually single valued?

2012-12-18 Thread Ryan McKinley
is there a JIRA ticket for this?

+1 to Robert's observation that this independent from any format discussion



On Wed, Nov 14, 2012 at 5:46 AM, Robert Muir rcm...@gmail.com wrote:

 On Tue, Nov 13, 2012 at 11:41 PM, Toke Eskildsen t...@statsbiblioteket.dk
 wrote:
  On Tue, 2012-11-13 at 19:50 +0100, Yonik Seeley wrote:
  The original version of Solr (SOLAR when it was still inside CNET) did
  this - a multiValued field with a single value was output as a singe
  value, not an array containing a single value.  Some people wanted
  more predictability (always an array or never an array).
 
  So there are two very different issues with this optimization:
 
  Under the hood, it looks like a win. The single value field cache is
  better performing (speed as well as memory) than the uninverted field.
  There's some trickery with index updates as re-use of structures gets
  interesting when all segments has been delivering single value and a
  multi-value segment is introduced.

 this isn't tricky. in solr these structures are top-level (on top of
 SlowMultiReaderWrapper).

 
  Dynamically changing response formats sounds horrible.

 I don't understand how this is related with my proposal to
 automatically use a different data structure behind the scenes.

 The optimization I am talking about is safe and simple and no user
 would have any idea.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Optimize facets when actually single valued?

2012-12-18 Thread Robert Muir
On Tue, Dec 18, 2012 at 8:06 PM, Ryan McKinley ryan...@gmail.com wrote:
 is there a JIRA ticket for this?

 +1 to Robert's observation that this independent from any format discussion


I dont know of one: but feel free!

I thought of the stats situation at some point:
terms.size == terms.sumDocFreq should be enough i think, for faceting purposes?
doesnt really mean the field is truly single valued, because a term
could exist twice for the same doc, but for faceting etc, we dont care
about that I think?
if we really want to check that no term has tf  1 within a doc, we'd
have to involve sumTotalTermFreq too: which is irrelevant here and
unavailable if frequencies are omitted

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize facets when actually single valued?

2012-11-16 Thread Toke Eskildsen
On Wed, 2012-11-14 at 14:46 +0100, Robert Muir wrote:
 On Tue, Nov 13, 2012 at 11:41 PM, Toke Eskildsen t...@statsbiblioteket.dk 
 wrote:
  Dynamically changing response formats sounds horrible.
 
 I don't understand how this is related with my proposal to
 automatically use a different data structure behind the scenes.

I replied to Yonik Seeley, who pointed out that the output format
historically had displayed this behavior. It is related because
automatic switching between single/multi-value in the inner workings
might also result in a mirrored switching of output formats. I know that
you have made no such claims - it is just general discussion of
different aspects of the single/multi-value issue.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize facets when actually single valued?

2012-11-16 Thread Yonik Seeley
On Wed, Nov 14, 2012 at 8:41 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 Dynamically changing response formats sounds horrible.

It depends if you consider it a change of format.  A single value
would always be presented as a single value, while multiple values
would always be represented as an array.  It's on a per-document
basis, and is not determined by whether the field as a whole is
multiValued.

To users of JSON, I think it's pretty natural:
[
  { id:doc1, author : David },
  { id:doc2, author : [Mike,Erik] }
]

One could think of it in reverse too (that the current way of doing
things is actually more prone to changing formats just because you
changed a type).
Say you indexed an author field as multiValued=false, but then
realized you needed to sometimes add multiple values... now everything
that had been coming back as author:David starts coming back as
author:[David]

Ryan wrote:
 If the only motivation for adding 'multiValued=flexible' is the response 
 format, what about just changing the response format version number

That's a good point.   It doesn't seem particularly valuable to
enable/disable this on a per-field basis, and one could see wanting to
concurrently support different clients that want their results
different ways.  That really argues for a request parameter (or
version) to control how multiValued fields are handled.

-Yonik
http://lucidworks.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize facets when actually single valued?

2012-11-14 Thread Robert Muir
On Tue, Nov 13, 2012 at 11:41 PM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 On Tue, 2012-11-13 at 19:50 +0100, Yonik Seeley wrote:
 The original version of Solr (SOLAR when it was still inside CNET) did
 this - a multiValued field with a single value was output as a singe
 value, not an array containing a single value.  Some people wanted
 more predictability (always an array or never an array).

 So there are two very different issues with this optimization:

 Under the hood, it looks like a win. The single value field cache is
 better performing (speed as well as memory) than the uninverted field.
 There's some trickery with index updates as re-use of structures gets
 interesting when all segments has been delivering single value and a
 multi-value segment is introduced.

this isn't tricky. in solr these structures are top-level (on top of
SlowMultiReaderWrapper).


 Dynamically changing response formats sounds horrible.

I don't understand how this is related with my proposal to
automatically use a different data structure behind the scenes.

The optimization I am talking about is safe and simple and no user
would have any idea.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize facets when actually single valued?

2012-11-14 Thread Ryan McKinley


 The optimization I am talking about is safe and simple and no user
 would have any idea.


+1

the end format should be a different issue -- under the hood, multivalued
fields should perform well if they are actually single valued.


Re: Optimize facets when actually single valued?

2012-11-13 Thread Ryan McKinley
If the only motivation for adding 'multiValued=flexible' is the response
format, what about just changing the response format version number and
writing the wrapping list based on that?

Allowing multiple values, but behaving like single value fields when only
one value exists would be a *huge* simplification for my app!

ryan




On Sun, Nov 11, 2012 at 7:09 AM, Yonik Seeley yo...@lucidworks.com wrote:

 On Sun, Nov 11, 2012 at 3:33 AM, Robert Muir rcm...@gmail.com wrote:
  I am guessing at times people are lazy about schema definition. But, I
 think
  with lucene 4 stats we can detect if a field is actually single valued...
  Something like terms.size == terms.doccount == terms.sumdocfreq. I have
 to
  think about it a bit, maybe its even simpler than this? Anyway, this
 couple
  be used instead of actual schema def to just build a fieldcache instead
 of
  uninverted field I think... Should be a simple opto but maybe potent...

 Funny you should mention this now - I was thinking exactly the same
 thing on the flight home from ApacheCon!

 This detect single-valued also has implications for things other
 than faceting as well - as you say, people can be lazy about the
 schema definition and having things just work is a good thing.

 I've thought about a more flexible field that acts like a single
 valued field when you use it like that, and a multi-valued field
 otherwise.  There won't quite be back compat with responses though
 (since multiValued fields with single values now look like
 foo:[single_value] instead of foo:single_value.)  Perhaps we
 could add something like multiValued=flexible or something (and switch
 to that by default), while retaining back compat for
 multiValued=true/false.  Either that or bump version of the schema
 or response.  This is actually pretty important if we ever want to do
 more schema-less (i.e. type guessing based on input), since it
 allows us to only guess type and not have to deal with figuring out
 multiValued.  It could lower the numer of dynamic field definitions
 necessary and make choosing the correct one simpler.

 -Yonik
 http://lucidworks.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Optimize facets when actually single valued?

2012-11-13 Thread Yonik Seeley
On Tue, Nov 13, 2012 at 6:37 PM, Ryan McKinley ryan...@gmail.com wrote:
 If the only motivation for adding 'multiValued=flexible' is the response
 format, what about just changing the response format version number and
 writing the wrapping list based on that?

The original version of Solr (SOLAR when it was still inside CNET) did
this - a multiValued field with a single value was output as a singe
value, not an array containing a single value.  Some people wanted
more predictability (always an array or never an array).

-Yonik
http://lucidworks.com


 Allowing multiple values, but behaving like single value fields when only
 one value exists would be a *huge* simplification for my app!

 ryan

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize facets when actually single valued?

2012-11-13 Thread Toke Eskildsen
On Tue, 2012-11-13 at 19:50 +0100, Yonik Seeley wrote:
 The original version of Solr (SOLAR when it was still inside CNET) did
 this - a multiValued field with a single value was output as a singe
 value, not an array containing a single value.  Some people wanted
 more predictability (always an array or never an array).

So there are two very different issues with this optimization:

Under the hood, it looks like a win. The single value field cache is
better performing (speed as well as memory) than the uninverted field.
There's some trickery with index updates as re-use of structures gets
interesting when all segments has been delivering single value and a
multi-value segment is introduced.

Dynamically changing response formats sounds horrible. The premise for
this optimization it laziness (or lack of oversight) from some users. If
the searcher normally returns one format, those users will design their
frontend from an expectation that it will _always_ return that format.

Always returning arrays, even when the underlying system has dynamically
selected single value mode and only a single value is returned, forces
the frontend programmers to consider both cases.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize facets when actually single valued?

2012-11-12 Thread Jimmy Sélamy
Hi,

The version of Solr is 3.6.1,

Here's my query, you can find it a bit huge! But i absolutly need all this
in my response.



q=*:*

fq=language_code:(fr_CA) AND acl_name:(cch_CP_AP_Archives OR
cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR
cch_official_feed_content OR cch_official_press_release_acl OR
cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR
cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR
cch_source_acl OR cch_wire_feeds_acl) AND feed_type:(WF OR OF OR
RW)fq=((type:(cch_published_story OR cch_story) AND
language_code:(fr_CA) AND acl_name:(cch_CP_AP_Archives OR
cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR
cch_official_feed_content OR cch_official_press_release_acl OR
cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR
cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR
cch_source_acl OR cch_wire_feeds_acl) AND feed_type:(WF OR OF OR RW))
OR (type:(cch_photo) AND mfile_url:([* TO *]) AND
acl_name:(cch_CP_AP_Archives OR cch_archive_content OR
cch_browse_official_feed_folder OR cch_folder_acl OR
cch_official_feed_content OR cch_official_press_release_acl OR
cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR
cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR
cch_source_acl OR cch_wire_feeds_acl) AND feed_type:(WF OR OF OR
RW)))rows=0start=0

facet.sort=count
facet.field=source_id
*facet.field=facet_tme_person_name_french*
*facet.field=facet_tme_geographic_location_french*
*facet.field=facet_tme_iptc_category*
*facet.field=facet_tme_organization_name_french*
facet.field=feed_type

f.source_id.facet.limit=-1
f.source_id.facet.mincount=1
f.facet_tme_person_name_french.facet.limit=25
f.facet_tme_person_name_french.facet.mincount=1
f.facet_tme_geographic_location_french.facet.limit=25
f.facet_tme_geographic_location_french.facet.mincount=1
f.facet_tme_iptc_category.facet.limit=25
f.facet_tme_iptc_category.facet.mincount=1
f.facet_tme_organization_name_french.facet.limit=25
f.facet_tme_organization_name_french.facet.mincount=1
f.feed_type.facet.limit=25
f.feed_type.facet.mincount=1
facet.range=r_creation_date1
facet.range=r_creation_date2
facet.range=r_creation_date3
facet.range=r_creation_date4
f.r_creation_date1.facet.range.start=NOW-1HOUR
f.r_creation_date1.facet.range.end=NOW
f.r_creation_date1.facet.range.gap=+1HOUR
f.r_creation_date2.facet.range.start=NOW-24HOUR
f.r_creation_date2.facet.range.end=NOW
f.r_creation_date2.facet.range.gap=+24HOUR
f.r_creation_date3.facet.range.start=NOW-48HOUR
f.r_creation_date3.facet.range.end=NOW
f.r_creation_date3.facet.range.gap=+48HOUR
f.r_creation_date4.facet.range.start=NOW-7DAY
f.r_creation_date4.facet.range.end=NOW
f.r_creation_date4.facet.range.gap=+7DAY

facet=true

=

The fields in bold are the fields that i'm having performance issues.

I've put the facet.method=enum this increase the performance perhaps it is
still not acceptable for my application. There are the log i've did with
the same fq perhaps with each facet field by themselves. Note that only the
facet name that starts with facet are my multivalued fields.


o Date range facet (681,25 ms)

o Feed type (586,5 ms)

o Categories (898 ms)

o facet_tme_geographic_location_french (1249 ms)

o facet_tme_person_name_french (1940,75 ms )

o facet_tme_organiztion_name_french (1240,75 ms)

All combined give me 6000 ms.

For the other questions you've asked me like How many unique values are
there in the field I don't know how to get this info.

*Jimmy M. Sélamy*


2012/11/11 Erick Erickson erickerick...@gmail.com

 You have to provide more details. How many unique values are there in the
 field in question? What's the query you're using? Are you sure other parts
 of the query aren't the culprit? What Solr version are you using?

 Please review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best
 Erick


 On Sat, Nov 10, 2012 at 9:41 PM, Jimmy Sélamy jym...@gmail.com wrote:

 **
 Im having perfomance issues with facet on multivalued field with an index
 over 20Million documents.

 And when doing faceting search on multivalued field the QTIME is
 unacceptable for my application because it can take up to 6000ms.

 Ive put the facet.method to enum! Which increased my performance to the
 time i just mentionned! Its still not acceptable.

 Is there any suggestions ?

 Envoyé avec BlackBerry sur le réseau mobile de Vidéotron
 --
 *From: * Robert Muir rcm...@gmail.com
 *Date: *Sat, 10 Nov 2012 21:33:47 -0500
 *To: *dev@lucene.apache.org
 *ReplyTo: * dev@lucene.apache.org
 *Subject: *Optimize facets when actually single valued?

 I am guessing at times people are lazy about schema definition. But, I
 think with lucene 4 stats we can detect if a field is actually single
 valued... Something like terms.size == terms.doccount == terms.sumdocfreq.
 I

Re: Optimize facets when actually single valued?

2012-11-11 Thread Erick Erickson
You have to provide more details. How many unique values are there in the
field in question? What's the query you're using? Are you sure other parts
of the query aren't the culprit? What Solr version are you using?

Please review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick


On Sat, Nov 10, 2012 at 9:41 PM, Jimmy Sélamy jym...@gmail.com wrote:

 **
 Im having perfomance issues with facet on multivalued field with an index
 over 20Million documents.

 And when doing faceting search on multivalued field the QTIME is
 unacceptable for my application because it can take up to 6000ms.

 Ive put the facet.method to enum! Which increased my performance to the
 time i just mentionned! Its still not acceptable.

 Is there any suggestions ?

 Envoyé avec BlackBerry sur le réseau mobile de Vidéotron
 --
 *From: * Robert Muir rcm...@gmail.com
 *Date: *Sat, 10 Nov 2012 21:33:47 -0500
 *To: *dev@lucene.apache.org
 *ReplyTo: * dev@lucene.apache.org
 *Subject: *Optimize facets when actually single valued?

 I am guessing at times people are lazy about schema definition. But, I
 think with lucene 4 stats we can detect if a field is actually single
 valued... Something like terms.size == terms.doccount == terms.sumdocfreq.
 I have to think about it a bit, maybe its even simpler than this? Anyway,
 this couple be used instead of actual schema def to just build a fieldcache
 instead of uninverted field I think... Should be a simple opto but maybe
 potent...



Re: Optimize facets when actually single valued?

2012-11-11 Thread Robert Muir
On Sat, Nov 10, 2012 at 9:41 PM, Jimmy Sélamy jym...@gmail.com wrote:
 Im having perfomance issues with facet on multivalued field with an index
 over 20Million documents.

 And when doing faceting search on multivalued field the QTIME is
 unacceptable for my application because it can take up to 6000ms.

 Ive put the facet.method to enum! Which increased my performance to the time
 i just mentionned! Its still not acceptable.

 Is there any suggestions ?


Yes: don't hijack my mailing list threads.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize facets when actually single valued?

2012-11-11 Thread Yonik Seeley
On Sun, Nov 11, 2012 at 3:33 AM, Robert Muir rcm...@gmail.com wrote:
 I am guessing at times people are lazy about schema definition. But, I think
 with lucene 4 stats we can detect if a field is actually single valued...
 Something like terms.size == terms.doccount == terms.sumdocfreq. I have to
 think about it a bit, maybe its even simpler than this? Anyway, this couple
 be used instead of actual schema def to just build a fieldcache instead of
 uninverted field I think... Should be a simple opto but maybe potent...

Funny you should mention this now - I was thinking exactly the same
thing on the flight home from ApacheCon!

This detect single-valued also has implications for things other
than faceting as well - as you say, people can be lazy about the
schema definition and having things just work is a good thing.

I've thought about a more flexible field that acts like a single
valued field when you use it like that, and a multi-valued field
otherwise.  There won't quite be back compat with responses though
(since multiValued fields with single values now look like
foo:[single_value] instead of foo:single_value.)  Perhaps we
could add something like multiValued=flexible or something (and switch
to that by default), while retaining back compat for
multiValued=true/false.  Either that or bump version of the schema
or response.  This is actually pretty important if we ever want to do
more schema-less (i.e. type guessing based on input), since it
allows us to only guess type and not have to deal with figuring out
multiValued.  It could lower the numer of dynamic field definitions
necessary and make choosing the correct one simpler.

-Yonik
http://lucidworks.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Optimize facets when actually single valued?

2012-11-10 Thread Robert Muir
I am guessing at times people are lazy about schema definition. But, I
think with lucene 4 stats we can detect if a field is actually single
valued... Something like terms.size == terms.doccount == terms.sumdocfreq.
I have to think about it a bit, maybe its even simpler than this? Anyway,
this couple be used instead of actual schema def to just build a fieldcache
instead of uninverted field I think... Should be a simple opto but maybe
potent...


Re: Optimize facets when actually single valued?

2012-11-10 Thread Jimmy Sélamy
Im having perfomance issues with facet on multivalued field with an index over 
20Million documents.

And when doing faceting search on multivalued field the QTIME is unacceptable 
for my application because it can take up to 6000ms. 

Ive put the facet.method to enum! Which increased my performance to the time i 
just mentionned! Its still not acceptable.

Is there any suggestions ? 


Envoyé avec BlackBerry sur le réseau mobile de Vidéotron

-Original Message-
From: Robert Muir rcm...@gmail.com
Date: Sat, 10 Nov 2012 21:33:47 
To: dev@lucene.apache.org
Reply-To: dev@lucene.apache.org
Subject: Optimize facets when actually single valued?

I am guessing at times people are lazy about schema definition. But, I
think with lucene 4 stats we can detect if a field is actually single
valued... Something like terms.size == terms.doccount == terms.sumdocfreq.
I have to think about it a bit, maybe its even simpler than this? Anyway,
this couple be used instead of actual schema def to just build a fieldcache
instead of uninverted field I think... Should be a simple opto but maybe
potent...