Elias,

I'm assuming you have one composite field that looks like so:

<guid>_<sha>_<time>

where <time> is ISO8601 basic format down to an hour resolution.

If this is the case then you _cannot_ perform an efficient, nor correct
(without doing extra work in your application), query on the <time> portion
of the field.  To use that field for time range, search would have to
perform a full scan of the compound field and then you would have to
filter/sort on <time>.  This is because <time> is independent of
<guid>_<sha> and can repeat itself.  If you have queries that are strictly
time based then a range query of the time field would be most appropriate.

In general, if you make a compound field <f1>_<f2>, and <f1> and <f2> are
independent of each other (e.g. guid and time) then you can only use a
range query on <f1>.  You could range on <f1>_<f2> but you'll include <f2>
values that are below/above your <f2> range because they are still inside
the <f1> range.  Depending on how large of a range <f1> matched this may
not be an issue and you could filter in your application.  You can also
make use of inline fields which effectively allows Search to do the
filtering for you at the index level.

In your case you have to ask two questions.

1. What is the uniqueness/selectivity/cardinality of each field?

2. How you plan to query the data?

<guid> - Is the least selective.  It has high cardinality for any given
value.  There may be many events for a given user over life of application
(or that's my assumption anyways).  Using this field alone is a bad idea,
but is a good candidate as an inline field.

<sha> - Is more selective because, as you stated, can be rare.  It will
have outliers which makes it's less selective in some cases.  Using this
field alone could prove costly but add an inline <guid> filter and it is
probably efficient most of the time

<time> - This is most selective _on average_ assuming a fairly normal
distribution of events over the course of time.  This shares a similar
trait with <sha> in that you could have hours of the day that are bursty.
 This field alone should be fine most of the time and improves if used with
inline <sha> or <guid>.

>From what I can understand you have two main queries:

a. All events for range <time1> to <time2> - in that case I think a range
on the time field would do just fine.  If you want to further refine based
on sha/guid then add them as inline fields and use a filter or even do it
in your application.

b. All events for large time period (or for life of application) for given
sha/guid combination - In this case since <sha> is more selective you'll
want to query via <sha> but use <time> and <guid> as inline filters.  You
_don't_ want to query by <guid> because it's not selective enough and will
cause Search to perform a large scan.

I hope this makes some sense.  I realize I've completely ignored your range
vs. prefix query question but that's because I don't think that's the real
issue here.

-Ryan


On Fri, Jan 6, 2012 at 8:05 PM, Elias Levy <[email protected]>wrote:

> I was wondering whether some of the Basho folk may have some wise words
> about when to choose between ranges and wildcards when using Riak Search.
>
> I've noticed in our systems that using wildcards will often give more
> performant results than using ranges, at least for our data and some of the
> indexes we've defined on them.  In some instances, the Solr search API will
> timeout when using a range query, but return results when using a wildcard.
>
>
> These indexes are pseudo-compound indexes.  We've taken three values form
> our data, and concatenated them into a single field for indexing purposes,
> so as increase cardinality of the field.  The first value may be a customer
> guid, the second a sha, which can be fairly rare or very common, and the
> last a timestamp bucketed to either the hour.   We usually use ranges to
> limited the time under analysis, but at some times we want large time
> ranges, possibly all the data for any time for the other values, and in
> those cases ranges seem to underperform a wildcard.
>
> Elias Levy
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to