Re: [MarkLogic Dev General] Higher relevance for newer documents?

Ken Tune Thu, 22 Aug 2013 04:42:27 -0700

Ron

There's a nice little graph of tanh here. Technically a hyperbolic function 
rather than a trig function FWIW.


http://www.roperld.com/science/MutualPositiveFeedback.htm

Tanh is a monotonic increasing function. That is if b > a then tanh(b) > 
tanh(a). So with a minus sign the opposite is true. The answer to "That is, the 
oldest item in the database would sort out last?" is therefore yes if using - A 
* tanh(constant * (document age))

For what you want, any monotonic decreasing function will do the trick. A 
straight line with negative gradient is one ( e.g. - alpha * ( age of document) 
) as is exp(-t). The choice is 'up to you'. You might not care that much. 
However you are limited by integer arithmetic.

It occurred to me later that exponential decay is also possible via  constant * 
(1 - exp( - ( minus) lambda * document age)). Choose 'constant' so it is large 
enough to give sufficient granularity. Lambda is your decay rate. You could put 
this ( or the above ) formulae into Excel to see that the fall off is for 
different values of lambda. Set lambda  = ln(2)/365 ( document age in days ) 
for example and the relevance halves for every year you add to the document age.

You can 'tweak' your function - and if you need something like

Quality                 =10 - xs:int(t/365)     (0 < t <= 9 * 365)
                =1                      ( t > 9*365)

Then I think that would be just fine too. It's still monotonic decreasing ( 
almost as if b > a then f(b) <= f(a) rather than f(b) < f(a))

Or you could do

Quality = constant * (1 - exp( - lambda * t)) ( 0 < t <= 10 * 365 )
        = constant * (1 - exp( - lambda * 10 * 365)) ( which is  a constant ) 
(t > 10 * 365 )

which would mean that relevance doesn't drop to zero

Ken




-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Ron Hitchens
Sent: 22 August 2013 12:17
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Higher relevance for newer documents?


Ken,

   Thanks for doing the math(s) on this - literally.  I must admit I don't 
remember much trigonometry.  Would the function you describe rank the content 
so that any content item that's older than another would always have a lower 
quality?  That is, the oldest item in the database would sort out last?

   I guess this means that a blank search (no word or value queries) would 
yield a result sequence that's effectively sorted by date in descending order?  
That's actually pretty handy.

   The only issue here is that a given client, STM publisher for example, may 
not want the old stuff to be increasingly less relevant.  They may want the 
recent stuff to come to the top, but anything older than, say, five years to 
just be of normal relevance.  If the freshness curve is calculated on 
publication date, then the really old articles (100+ years in some cases) would 
rarely come up.

   I suppose if you carefully chose the numbers you could put a lower cap on 
quality so that anything farther back than 10 or 20 years would be set to 
neutral quality.  Over the reasonable lifetime of such a system this scheme 
would probably work fine.

   Thanks.

---
Ron Hitchens {[email protected]}  +44 7879 358212


On Aug 21, 2013, at 9:53 AM, Ken Tune <[email protected]> wrote:

> Use of a minus sign and some rounding may help. Remember we can have negative 
> quality.
> 
> Let's take Mike's original suggestion and write
> 
> - ( minus ) xs:int(2 ^ (current date - chosen fixed date) div constant 
> )
> 
> Choose chosen fixed date so it is the least recent date in the database.
> 
> Decide how big you want the above number to get - let's say no bigger 
> than 2 ^ 16
> 
> Choose a far field date - say today + 30 years
> 
> 16 = (current date + 30 years - chosen fixed date ) div constant allowing you 
> to find what your constant value should be.
> 
> Courtesy of the minus sign, the bigger the above number gets, the less 
> relevant the document is.
> 
> A slightly more elegant choice would be a function which 
> asymptotically approaches a constant value e.g. tanh ( = exp(x) 
> -exp(-x) / exp(x) + exp(-x) ( has values from -1 to +1 )
> 
> Decide how 'big' this number can get e.g. 2^16, and choose a scaling 
> factor ( say lambda ) so that your value of
> 
> - 2^16 tanh(lamda * (current date - fixed date) varies sufficiently. 
> 
> So for 30 year + content let's say we want this in the region when 
> tanh(x) = 0.9 +
> 
> tanh(lamda * ( current date + 30 years - fixed date )) = 0.9. 
> 
> Some basic manipulation gets you lamda.
> 
> Wrap xs:integer round the above, as quality has to be an integer value.
> 
> Ken
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Ron 
> Hitchens
> Sent: 20 August 2013 21:01
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Higher relevance for newer documents?
> 
> 
>   Because your quality value would have to be exponentially increasing over 
> time.  As you move up the increasingly vertical curve, you'll soon shoot past 
> the magnitude of a 32 bit number.
> 
>   What you really want is for the curve to fall quickly into the past from 
> now, then level off the further back you go.  You'd want that curve to be 
> computed relative to the query time, not the ingestion time.
> 
>   You could do the exponential thing if you constantly crawl the content and 
> re-adjust the quality values.  But not if you stick a constant number on the 
> document that doesn't change over its lifetime.
> 
> ---
> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>     +44 7879 358 212 (voice)          http://www.ronsoft.com
>     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
> 
> On Aug 20, 2013, at 8:48 PM, David Gorbet <[email protected]> wrote:
> 
>> Why couldn't you do an exponential decay? You control the formula, right? It 
>> could be (weeks-since-1970)^2, couldn't it?
>> 
>> Sent from my Windows Phone
>> From: Ron Hitchens
>> Sent: ‎8/‎20/‎2013 12:46 PM
>> To: MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] Higher relevance for newer documents?
>> 
>> 
>>   Thanks Mike.  I'd looked at a similar idea involving the copyright 
>> year (that was too coarse).  The number of weeks since some distant 
>> date is a pretty good idea.
>> 
>>   I suppose the biggest weakness of this algorithm is that it is 
>> necessarily linear.  You can't do an exponential decay where the 
>> quality of recent document drops off quickly and then levels of as 
>> they get older.  Though linear is better than nothing.
>> 
>>   Is there any downside to constantly increasing quality values?
>> The quality argument of xdmp:set-document-quality is a 32-bit xs:int.
>> Is the relevance boost from quality applied evenly across the range 
>> of possible values?
>> 
>>   Thanks.
>> 
>> ---
>> Ron Hitchens {[email protected]}  +44 7879 358212
>> 
>> 
>> On Aug 20, 2013, at 7:20 PM, Michael Blakeley <[email protected]> wrote:
>> 
>>> What about using a naturally-increasing number for quality?
>>> 
>>> For example the number of weeks since 1970:
>>> 
>>>   xs:integer(
>>>     (current-date() - xs:date('1970-01-01'))
>>>      div xs:dayTimeDuration("P7D"))
>>>   => 531
>>> 
>>> You can reduce the magnitude of the quality boost by increasing the bucket 
>>> size: 14D, 30D, etc. Or changing the start-date might also be useful.
>>> 
>>> No crawl is necessary, unless you change your mind about the boost 
>>> algorithm.
>>> 
>>> -- Mike
>>> 
>>> On 20 Aug 2013, at 11:10 , Ron Hitchens <[email protected]> wrote:
>>> 
>>>> 
>>>> What are the techniques out there for giving newer documents higher 
>>>> relevance?  My target is MarkLogic 5.x, but 6.x may be in play 
>>>> before long.
>>>> 
>>>> There are two schemes that I am aware of, neither of which feels 
>>>> very elegant:
>>>> 
>>>> 1) Give documents a high quality value when ingested.  Periodically 
>>>> crawl the content and for any document with positive quality, 
>>>> reduce its quality according to some algorithm until the quality reaches 
>>>> zero.
>>>> 
>>>> This gives the best control over "freshness", but has the 
>>>> disadvantage of causing potentially large numbers of updates on 
>>>> each pass with the attendant merges and disk I/O & CPU load.
>>>> 
>>>> 2) Replicate the "real" query n times, each and-ed with a 
>>>> time-based query against the insertion date.  All of these are 
>>>> or-ed together with descending weights for older dates.
>>>> 
>>>> This does't require changing documents to tweak their freshness.  
>>>> But it also means you have a stair-step function of n-steps, which 
>>>> may not be very precise - and which wouldn't scale very well for 
>>>> large values of n.  And unfortunately, since the queries would be 
>>>> time-based, you can't pre-register them ahead of time.
>>>> 
>>>> Any other clever techniques that you've used?
>>>> 
>>>> ---
>>>> Ron Hitchens {[email protected]}  +44 7879 358212
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>> 
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Higher relevance for newer documents?

Reply via email to