Re: [MarkLogic Dev General] Higher relevance for newer documents?

Ken Tune Wed, 21 Aug 2013 01:53:27 -0700

Use of a minus sign and some rounding may help. Remember we can have negative 
quality.


Let's take Mike's original suggestion and write

- ( minus ) xs:int(2 ^ (current date - chosen fixed date) div constant )

Choose chosen fixed date so it is the least recent date in the database.

Decide how big you want the above number to get - let's say no bigger than 2 ^ 
16

Choose a far field date - say today + 30 years

16 = (current date + 30 years - chosen fixed date ) div constant allowing you 
to find what your constant value should be.

Courtesy of the minus sign, the bigger the above number gets, the less relevant 
the document is.

A slightly more elegant choice would be a function which asymptotically 
approaches a constant value e.g. tanh ( = exp(x) -exp(-x) / exp(x) + exp(-x) ( 
has values from -1 to +1 ) 

Decide how 'big' this number can get e.g. 2^16, and choose a scaling factor ( 
say lambda ) so that your value of

- 2^16 tanh(lamda * (current date - fixed date) varies sufficiently. 

So for 30 year + content let's say we want this in the region when tanh(x) = 
0.9 + 

tanh(lamda * ( current date + 30 years - fixed date )) = 0.9. 

Some basic manipulation gets you lamda.

Wrap xs:integer round the above, as quality has to be an integer value.

Ken

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Ron Hitchens
Sent: 20 August 2013 21:01
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Higher relevance for newer documents?


   Because your quality value would have to be exponentially increasing over 
time.  As you move up the increasingly vertical curve, you'll soon shoot past 
the magnitude of a 32 bit number.

   What you really want is for the curve to fall quickly into the past from 
now, then level off the further back you go.  You'd want that curve to be 
computed relative to the query time, not the ingestion time.

   You could do the exponential thing if you constantly crawl the content and 
re-adjust the quality values.  But not if you stick a constant number on the 
document that doesn't change over its lifetime.

---
Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
     +44 7879 358 212 (voice)          http://www.ronsoft.com
     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
"No amount of belief establishes any fact." -Unknown

On Aug 20, 2013, at 8:48 PM, David Gorbet <[email protected]> wrote:

> Why couldn't you do an exponential decay? You control the formula, right? It 
> could be (weeks-since-1970)^2, couldn't it?
> 
> Sent from my Windows Phone
> From: Ron Hitchens
> Sent: ‎8/‎20/‎2013 12:46 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Higher relevance for newer documents?
> 
> 
>    Thanks Mike.  I'd looked at a similar idea involving the copyright 
> year (that was too coarse).  The number of weeks since some distant 
> date is a pretty good idea.
> 
>    I suppose the biggest weakness of this algorithm is that it is 
> necessarily linear.  You can't do an exponential decay where the 
> quality of recent document drops off quickly and then levels of as 
> they get older.  Though linear is better than nothing.
> 
>    Is there any downside to constantly increasing quality values?
> The quality argument of xdmp:set-document-quality is a 32-bit xs:int.
> Is the relevance boost from quality applied evenly across the range of 
> possible values?
> 
>    Thanks.
> 
> ---
> Ron Hitchens {[email protected]}  +44 7879 358212
> 
> 
> On Aug 20, 2013, at 7:20 PM, Michael Blakeley <[email protected]> wrote:
> 
> > What about using a naturally-increasing number for quality?
> > 
> > For example the number of weeks since 1970:
> > 
> >    xs:integer(
> >      (current-date() - xs:date('1970-01-01'))
> >       div xs:dayTimeDuration("P7D"))
> >    => 531
> > 
> > You can reduce the magnitude of the quality boost by increasing the bucket 
> > size: 14D, 30D, etc. Or changing the start-date might also be useful.
> > 
> > No crawl is necessary, unless you change your mind about the boost 
> > algorithm.
> > 
> > -- Mike
> > 
> > On 20 Aug 2013, at 11:10 , Ron Hitchens <[email protected]> wrote:
> > 
> >> 
> >>  What are the techniques out there for giving newer documents 
> >> higher relevance?  My target is MarkLogic 5.x, but 6.x may be in 
> >> play before long.
> >> 
> >>  There are two schemes that I am aware of, neither of which feels 
> >> very elegant:
> >> 
> >> 1) Give documents a high quality value when ingested.  Periodically 
> >> crawl the content and for any document with positive quality, 
> >> reduce its quality according to some algorithm until the quality reaches 
> >> zero.
> >> 
> >>  This gives the best control over "freshness", but has the 
> >> disadvantage of causing potentially large numbers of updates on 
> >> each pass with the attendant merges and disk I/O & CPU load.
> >> 
> >> 2) Replicate the "real" query n times, each and-ed with a 
> >> time-based query against the insertion date.  All of these are 
> >> or-ed together with descending weights for older dates.
> >> 
> >>  This does't require changing documents to tweak their freshness.  
> >> But it also means you have a stair-step function of n-steps, which 
> >> may not be very precise - and which wouldn't scale very well for 
> >> large values of n.  And unfortunately, since the queries would be 
> >> time-based, you can't pre-register them ahead of time.
> >> 
> >>  Any other clever techniques that you've used?
> >> 
> >> ---
> >> Ron Hitchens {[email protected]}  +44 7879 358212
> >> 
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> General mailing list
> >> [email protected]
> >> http://developer.marklogic.com/mailman/listinfo/general
> >> 
> > 
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Higher relevance for newer documents?

Reply via email to