Re: Optimize facets when actually single valued?
is there a JIRA ticket for this? +1 to Robert's observation that this independent from any format discussion On Wed, Nov 14, 2012 at 5:46 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Nov 13, 2012 at 11:41 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Tue, 2012-11-13 at 19:50 +0100, Yonik Seeley wrote: The original version of Solr (SOLAR when it was still inside CNET) did this - a multiValued field with a single value was output as a singe value, not an array containing a single value. Some people wanted more predictability (always an array or never an array). So there are two very different issues with this optimization: Under the hood, it looks like a win. The single value field cache is better performing (speed as well as memory) than the uninverted field. There's some trickery with index updates as re-use of structures gets interesting when all segments has been delivering single value and a multi-value segment is introduced. this isn't tricky. in solr these structures are top-level (on top of SlowMultiReaderWrapper). Dynamically changing response formats sounds horrible. I don't understand how this is related with my proposal to automatically use a different data structure behind the scenes. The optimization I am talking about is safe and simple and no user would have any idea. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
On Tue, Dec 18, 2012 at 8:06 PM, Ryan McKinley ryan...@gmail.com wrote: is there a JIRA ticket for this? +1 to Robert's observation that this independent from any format discussion I dont know of one: but feel free! I thought of the stats situation at some point: terms.size == terms.sumDocFreq should be enough i think, for faceting purposes? doesnt really mean the field is truly single valued, because a term could exist twice for the same doc, but for faceting etc, we dont care about that I think? if we really want to check that no term has tf 1 within a doc, we'd have to involve sumTotalTermFreq too: which is irrelevant here and unavailable if frequencies are omitted - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
On Wed, 2012-11-14 at 14:46 +0100, Robert Muir wrote: On Tue, Nov 13, 2012 at 11:41 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Dynamically changing response formats sounds horrible. I don't understand how this is related with my proposal to automatically use a different data structure behind the scenes. I replied to Yonik Seeley, who pointed out that the output format historically had displayed this behavior. It is related because automatic switching between single/multi-value in the inner workings might also result in a mirrored switching of output formats. I know that you have made no such claims - it is just general discussion of different aspects of the single/multi-value issue. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
On Wed, Nov 14, 2012 at 8:41 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Dynamically changing response formats sounds horrible. It depends if you consider it a change of format. A single value would always be presented as a single value, while multiple values would always be represented as an array. It's on a per-document basis, and is not determined by whether the field as a whole is multiValued. To users of JSON, I think it's pretty natural: [ { id:doc1, author : David }, { id:doc2, author : [Mike,Erik] } ] One could think of it in reverse too (that the current way of doing things is actually more prone to changing formats just because you changed a type). Say you indexed an author field as multiValued=false, but then realized you needed to sometimes add multiple values... now everything that had been coming back as author:David starts coming back as author:[David] Ryan wrote: If the only motivation for adding 'multiValued=flexible' is the response format, what about just changing the response format version number That's a good point. It doesn't seem particularly valuable to enable/disable this on a per-field basis, and one could see wanting to concurrently support different clients that want their results different ways. That really argues for a request parameter (or version) to control how multiValued fields are handled. -Yonik http://lucidworks.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
On Tue, Nov 13, 2012 at 11:41 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Tue, 2012-11-13 at 19:50 +0100, Yonik Seeley wrote: The original version of Solr (SOLAR when it was still inside CNET) did this - a multiValued field with a single value was output as a singe value, not an array containing a single value. Some people wanted more predictability (always an array or never an array). So there are two very different issues with this optimization: Under the hood, it looks like a win. The single value field cache is better performing (speed as well as memory) than the uninverted field. There's some trickery with index updates as re-use of structures gets interesting when all segments has been delivering single value and a multi-value segment is introduced. this isn't tricky. in solr these structures are top-level (on top of SlowMultiReaderWrapper). Dynamically changing response formats sounds horrible. I don't understand how this is related with my proposal to automatically use a different data structure behind the scenes. The optimization I am talking about is safe and simple and no user would have any idea. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
The optimization I am talking about is safe and simple and no user would have any idea. +1 the end format should be a different issue -- under the hood, multivalued fields should perform well if they are actually single valued.
Re: Optimize facets when actually single valued?
If the only motivation for adding 'multiValued=flexible' is the response format, what about just changing the response format version number and writing the wrapping list based on that? Allowing multiple values, but behaving like single value fields when only one value exists would be a *huge* simplification for my app! ryan On Sun, Nov 11, 2012 at 7:09 AM, Yonik Seeley yo...@lucidworks.com wrote: On Sun, Nov 11, 2012 at 3:33 AM, Robert Muir rcm...@gmail.com wrote: I am guessing at times people are lazy about schema definition. But, I think with lucene 4 stats we can detect if a field is actually single valued... Something like terms.size == terms.doccount == terms.sumdocfreq. I have to think about it a bit, maybe its even simpler than this? Anyway, this couple be used instead of actual schema def to just build a fieldcache instead of uninverted field I think... Should be a simple opto but maybe potent... Funny you should mention this now - I was thinking exactly the same thing on the flight home from ApacheCon! This detect single-valued also has implications for things other than faceting as well - as you say, people can be lazy about the schema definition and having things just work is a good thing. I've thought about a more flexible field that acts like a single valued field when you use it like that, and a multi-valued field otherwise. There won't quite be back compat with responses though (since multiValued fields with single values now look like foo:[single_value] instead of foo:single_value.) Perhaps we could add something like multiValued=flexible or something (and switch to that by default), while retaining back compat for multiValued=true/false. Either that or bump version of the schema or response. This is actually pretty important if we ever want to do more schema-less (i.e. type guessing based on input), since it allows us to only guess type and not have to deal with figuring out multiValued. It could lower the numer of dynamic field definitions necessary and make choosing the correct one simpler. -Yonik http://lucidworks.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
On Tue, Nov 13, 2012 at 6:37 PM, Ryan McKinley ryan...@gmail.com wrote: If the only motivation for adding 'multiValued=flexible' is the response format, what about just changing the response format version number and writing the wrapping list based on that? The original version of Solr (SOLAR when it was still inside CNET) did this - a multiValued field with a single value was output as a singe value, not an array containing a single value. Some people wanted more predictability (always an array or never an array). -Yonik http://lucidworks.com Allowing multiple values, but behaving like single value fields when only one value exists would be a *huge* simplification for my app! ryan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
On Tue, 2012-11-13 at 19:50 +0100, Yonik Seeley wrote: The original version of Solr (SOLAR when it was still inside CNET) did this - a multiValued field with a single value was output as a singe value, not an array containing a single value. Some people wanted more predictability (always an array or never an array). So there are two very different issues with this optimization: Under the hood, it looks like a win. The single value field cache is better performing (speed as well as memory) than the uninverted field. There's some trickery with index updates as re-use of structures gets interesting when all segments has been delivering single value and a multi-value segment is introduced. Dynamically changing response formats sounds horrible. The premise for this optimization it laziness (or lack of oversight) from some users. If the searcher normally returns one format, those users will design their frontend from an expectation that it will _always_ return that format. Always returning arrays, even when the underlying system has dynamically selected single value mode and only a single value is returned, forces the frontend programmers to consider both cases. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
Hi, The version of Solr is 3.6.1, Here's my query, you can find it a bit huge! But i absolutly need all this in my response. q=*:* fq=language_code:(fr_CA) AND acl_name:(cch_CP_AP_Archives OR cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR cch_official_feed_content OR cch_official_press_release_acl OR cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR cch_source_acl OR cch_wire_feeds_acl) AND feed_type:(WF OR OF OR RW)fq=((type:(cch_published_story OR cch_story) AND language_code:(fr_CA) AND acl_name:(cch_CP_AP_Archives OR cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR cch_official_feed_content OR cch_official_press_release_acl OR cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR cch_source_acl OR cch_wire_feeds_acl) AND feed_type:(WF OR OF OR RW)) OR (type:(cch_photo) AND mfile_url:([* TO *]) AND acl_name:(cch_CP_AP_Archives OR cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR cch_official_feed_content OR cch_official_press_release_acl OR cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR cch_source_acl OR cch_wire_feeds_acl) AND feed_type:(WF OR OF OR RW)))rows=0start=0 facet.sort=count facet.field=source_id *facet.field=facet_tme_person_name_french* *facet.field=facet_tme_geographic_location_french* *facet.field=facet_tme_iptc_category* *facet.field=facet_tme_organization_name_french* facet.field=feed_type f.source_id.facet.limit=-1 f.source_id.facet.mincount=1 f.facet_tme_person_name_french.facet.limit=25 f.facet_tme_person_name_french.facet.mincount=1 f.facet_tme_geographic_location_french.facet.limit=25 f.facet_tme_geographic_location_french.facet.mincount=1 f.facet_tme_iptc_category.facet.limit=25 f.facet_tme_iptc_category.facet.mincount=1 f.facet_tme_organization_name_french.facet.limit=25 f.facet_tme_organization_name_french.facet.mincount=1 f.feed_type.facet.limit=25 f.feed_type.facet.mincount=1 facet.range=r_creation_date1 facet.range=r_creation_date2 facet.range=r_creation_date3 facet.range=r_creation_date4 f.r_creation_date1.facet.range.start=NOW-1HOUR f.r_creation_date1.facet.range.end=NOW f.r_creation_date1.facet.range.gap=+1HOUR f.r_creation_date2.facet.range.start=NOW-24HOUR f.r_creation_date2.facet.range.end=NOW f.r_creation_date2.facet.range.gap=+24HOUR f.r_creation_date3.facet.range.start=NOW-48HOUR f.r_creation_date3.facet.range.end=NOW f.r_creation_date3.facet.range.gap=+48HOUR f.r_creation_date4.facet.range.start=NOW-7DAY f.r_creation_date4.facet.range.end=NOW f.r_creation_date4.facet.range.gap=+7DAY facet=true = The fields in bold are the fields that i'm having performance issues. I've put the facet.method=enum this increase the performance perhaps it is still not acceptable for my application. There are the log i've did with the same fq perhaps with each facet field by themselves. Note that only the facet name that starts with facet are my multivalued fields. o Date range facet (681,25 ms) o Feed type (586,5 ms) o Categories (898 ms) o facet_tme_geographic_location_french (1249 ms) o facet_tme_person_name_french (1940,75 ms ) o facet_tme_organiztion_name_french (1240,75 ms) All combined give me 6000 ms. For the other questions you've asked me like How many unique values are there in the field I don't know how to get this info. *Jimmy M. Sélamy* 2012/11/11 Erick Erickson erickerick...@gmail.com You have to provide more details. How many unique values are there in the field in question? What's the query you're using? Are you sure other parts of the query aren't the culprit? What Solr version are you using? Please review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Sat, Nov 10, 2012 at 9:41 PM, Jimmy Sélamy jym...@gmail.com wrote: ** Im having perfomance issues with facet on multivalued field with an index over 20Million documents. And when doing faceting search on multivalued field the QTIME is unacceptable for my application because it can take up to 6000ms. Ive put the facet.method to enum! Which increased my performance to the time i just mentionned! Its still not acceptable. Is there any suggestions ? Envoyé avec BlackBerry sur le réseau mobile de Vidéotron -- *From: * Robert Muir rcm...@gmail.com *Date: *Sat, 10 Nov 2012 21:33:47 -0500 *To: *dev@lucene.apache.org *ReplyTo: * dev@lucene.apache.org *Subject: *Optimize facets when actually single valued? I am guessing at times people are lazy about schema definition. But, I think with lucene 4 stats we can detect if a field is actually single valued... Something like terms.size == terms.doccount == terms.sumdocfreq. I
Re: Optimize facets when actually single valued?
You have to provide more details. How many unique values are there in the field in question? What's the query you're using? Are you sure other parts of the query aren't the culprit? What Solr version are you using? Please review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Sat, Nov 10, 2012 at 9:41 PM, Jimmy Sélamy jym...@gmail.com wrote: ** Im having perfomance issues with facet on multivalued field with an index over 20Million documents. And when doing faceting search on multivalued field the QTIME is unacceptable for my application because it can take up to 6000ms. Ive put the facet.method to enum! Which increased my performance to the time i just mentionned! Its still not acceptable. Is there any suggestions ? Envoyé avec BlackBerry sur le réseau mobile de Vidéotron -- *From: * Robert Muir rcm...@gmail.com *Date: *Sat, 10 Nov 2012 21:33:47 -0500 *To: *dev@lucene.apache.org *ReplyTo: * dev@lucene.apache.org *Subject: *Optimize facets when actually single valued? I am guessing at times people are lazy about schema definition. But, I think with lucene 4 stats we can detect if a field is actually single valued... Something like terms.size == terms.doccount == terms.sumdocfreq. I have to think about it a bit, maybe its even simpler than this? Anyway, this couple be used instead of actual schema def to just build a fieldcache instead of uninverted field I think... Should be a simple opto but maybe potent...
Re: Optimize facets when actually single valued?
On Sat, Nov 10, 2012 at 9:41 PM, Jimmy Sélamy jym...@gmail.com wrote: Im having perfomance issues with facet on multivalued field with an index over 20Million documents. And when doing faceting search on multivalued field the QTIME is unacceptable for my application because it can take up to 6000ms. Ive put the facet.method to enum! Which increased my performance to the time i just mentionned! Its still not acceptable. Is there any suggestions ? Yes: don't hijack my mailing list threads. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Optimize facets when actually single valued?
On Sun, Nov 11, 2012 at 3:33 AM, Robert Muir rcm...@gmail.com wrote: I am guessing at times people are lazy about schema definition. But, I think with lucene 4 stats we can detect if a field is actually single valued... Something like terms.size == terms.doccount == terms.sumdocfreq. I have to think about it a bit, maybe its even simpler than this? Anyway, this couple be used instead of actual schema def to just build a fieldcache instead of uninverted field I think... Should be a simple opto but maybe potent... Funny you should mention this now - I was thinking exactly the same thing on the flight home from ApacheCon! This detect single-valued also has implications for things other than faceting as well - as you say, people can be lazy about the schema definition and having things just work is a good thing. I've thought about a more flexible field that acts like a single valued field when you use it like that, and a multi-valued field otherwise. There won't quite be back compat with responses though (since multiValued fields with single values now look like foo:[single_value] instead of foo:single_value.) Perhaps we could add something like multiValued=flexible or something (and switch to that by default), while retaining back compat for multiValued=true/false. Either that or bump version of the schema or response. This is actually pretty important if we ever want to do more schema-less (i.e. type guessing based on input), since it allows us to only guess type and not have to deal with figuring out multiValued. It could lower the numer of dynamic field definitions necessary and make choosing the correct one simpler. -Yonik http://lucidworks.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Optimize facets when actually single valued?
I am guessing at times people are lazy about schema definition. But, I think with lucene 4 stats we can detect if a field is actually single valued... Something like terms.size == terms.doccount == terms.sumdocfreq. I have to think about it a bit, maybe its even simpler than this? Anyway, this couple be used instead of actual schema def to just build a fieldcache instead of uninverted field I think... Should be a simple opto but maybe potent...
Re: Optimize facets when actually single valued?
Im having perfomance issues with facet on multivalued field with an index over 20Million documents. And when doing faceting search on multivalued field the QTIME is unacceptable for my application because it can take up to 6000ms. Ive put the facet.method to enum! Which increased my performance to the time i just mentionned! Its still not acceptable. Is there any suggestions ? Envoyé avec BlackBerry sur le réseau mobile de Vidéotron -Original Message- From: Robert Muir rcm...@gmail.com Date: Sat, 10 Nov 2012 21:33:47 To: dev@lucene.apache.org Reply-To: dev@lucene.apache.org Subject: Optimize facets when actually single valued? I am guessing at times people are lazy about schema definition. But, I think with lucene 4 stats we can detect if a field is actually single valued... Something like terms.size == terms.doccount == terms.sumdocfreq. I have to think about it a bit, maybe its even simpler than this? Anyway, this couple be used instead of actual schema def to just build a fieldcache instead of uninverted field I think... Should be a simple opto but maybe potent...