Elias, I'm assuming you have one composite field that looks like so:
<guid>_<sha>_<time> where <time> is ISO8601 basic format down to an hour resolution. If this is the case then you _cannot_ perform an efficient, nor correct (without doing extra work in your application), query on the <time> portion of the field. To use that field for time range, search would have to perform a full scan of the compound field and then you would have to filter/sort on <time>. This is because <time> is independent of <guid>_<sha> and can repeat itself. If you have queries that are strictly time based then a range query of the time field would be most appropriate. In general, if you make a compound field <f1>_<f2>, and <f1> and <f2> are independent of each other (e.g. guid and time) then you can only use a range query on <f1>. You could range on <f1>_<f2> but you'll include <f2> values that are below/above your <f2> range because they are still inside the <f1> range. Depending on how large of a range <f1> matched this may not be an issue and you could filter in your application. You can also make use of inline fields which effectively allows Search to do the filtering for you at the index level. In your case you have to ask two questions. 1. What is the uniqueness/selectivity/cardinality of each field? 2. How you plan to query the data? <guid> - Is the least selective. It has high cardinality for any given value. There may be many events for a given user over life of application (or that's my assumption anyways). Using this field alone is a bad idea, but is a good candidate as an inline field. <sha> - Is more selective because, as you stated, can be rare. It will have outliers which makes it's less selective in some cases. Using this field alone could prove costly but add an inline <guid> filter and it is probably efficient most of the time <time> - This is most selective _on average_ assuming a fairly normal distribution of events over the course of time. This shares a similar trait with <sha> in that you could have hours of the day that are bursty. This field alone should be fine most of the time and improves if used with inline <sha> or <guid>. >From what I can understand you have two main queries: a. All events for range <time1> to <time2> - in that case I think a range on the time field would do just fine. If you want to further refine based on sha/guid then add them as inline fields and use a filter or even do it in your application. b. All events for large time period (or for life of application) for given sha/guid combination - In this case since <sha> is more selective you'll want to query via <sha> but use <time> and <guid> as inline filters. You _don't_ want to query by <guid> because it's not selective enough and will cause Search to perform a large scan. I hope this makes some sense. I realize I've completely ignored your range vs. prefix query question but that's because I don't think that's the real issue here. -Ryan On Fri, Jan 6, 2012 at 8:05 PM, Elias Levy <[email protected]>wrote: > I was wondering whether some of the Basho folk may have some wise words > about when to choose between ranges and wildcards when using Riak Search. > > I've noticed in our systems that using wildcards will often give more > performant results than using ranges, at least for our data and some of the > indexes we've defined on them. In some instances, the Solr search API will > timeout when using a range query, but return results when using a wildcard. > > > These indexes are pseudo-compound indexes. We've taken three values form > our data, and concatenated them into a single field for indexing purposes, > so as increase cardinality of the field. The first value may be a customer > guid, the second a sha, which can be fairly rare or very common, and the > last a timestamp bucketed to either the hour. We usually use ranges to > limited the time under analysis, but at some times we want large time > ranges, possibly all the data for any time for the other values, and in > those cases ranges seem to underperform a wildcard. > > Elias Levy > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
