Re: [MarkLogic Dev General] Higher relevance for newer documents?

Ron Hitchens Thu, 22 Aug 2013 04:18:26 -0700

Ken,

   Thanks for doing the math(s) on this - literally.  I must admit
I don't remember much trigonometry.  Would the function you describe
rank the content so that any content item that's older than another
would always have a lower quality?  That is, the oldest item in the database
would sort out last?


   I guess this means that a blank search (no word or value queries) would
yield a result sequence that's effectively sorted by date in descending
order?  That's actually pretty handy.

   The only issue here is that a given client, STM publisher for example,
may not want the old stuff to be increasingly less relevant.  They may want
the recent stuff to come to the top, but anything older than, say, five years
to just be of normal relevance.  If the freshness curve is calculated on
publication date, then the really old articles (100+ years in some cases)
would rarely come up.

   I suppose if you carefully chose the numbers you could put a lower cap
on quality so that anything farther back than 10 or 20 years would be set
to neutral quality.  Over the reasonable lifetime of such a system this
scheme would probably work fine.

   Thanks.

---
Ron Hitchens {[email protected]}  +44 7879 358212


On Aug 21, 2013, at 9:53 AM, Ken Tune <[email protected]> wrote:

> Use of a minus sign and some rounding may help. Remember we can have negative 
> quality.
> 
> Let's take Mike's original suggestion and write
> 
> - ( minus ) xs:int(2 ^ (current date - chosen fixed date) div constant )
> 
> Choose chosen fixed date so it is the least recent date in the database.
> 
> Decide how big you want the above number to get - let's say no bigger than 2 
> ^ 16
> 
> Choose a far field date - say today + 30 years
> 
> 16 = (current date + 30 years - chosen fixed date ) div constant allowing you 
> to find what your constant value should be.
> 
> Courtesy of the minus sign, the bigger the above number gets, the less 
> relevant the document is.
> 
> A slightly more elegant choice would be a function which asymptotically 
> approaches a constant value e.g. tanh ( = exp(x) -exp(-x) / exp(x) + exp(-x) 
> ( has values from -1 to +1 ) 
> 
> Decide how 'big' this number can get e.g. 2^16, and choose a scaling factor ( 
> say lambda ) so that your value of
> 
> - 2^16 tanh(lamda * (current date - fixed date) varies sufficiently. 
> 
> So for 30 year + content let's say we want this in the region when tanh(x) = 
> 0.9 + 
> 
> tanh(lamda * ( current date + 30 years - fixed date )) = 0.9. 
> 
> Some basic manipulation gets you lamda.
> 
> Wrap xs:integer round the above, as quality has to be an integer value.
> 
> Ken
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Ron Hitchens
> Sent: 20 August 2013 21:01
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Higher relevance for newer documents?
> 
> 
>   Because your quality value would have to be exponentially increasing over 
> time.  As you move up the increasingly vertical curve, you'll soon shoot past 
> the magnitude of a 32 bit number.
> 
>   What you really want is for the curve to fall quickly into the past from 
> now, then level off the further back you go.  You'd want that curve to be 
> computed relative to the query time, not the ingestion time.
> 
>   You could do the exponential thing if you constantly crawl the content and 
> re-adjust the quality values.  But not if you stick a constant number on the 
> document that doesn't change over its lifetime.
> 
> ---
> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>     +44 7879 358 212 (voice)          http://www.ronsoft.com
>     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
> 
> On Aug 20, 2013, at 8:48 PM, David Gorbet <[email protected]> wrote:
> 
>> Why couldn't you do an exponential decay? You control the formula, right? It 
>> could be (weeks-since-1970)^2, couldn't it?
>> 
>> Sent from my Windows Phone
>> From: Ron Hitchens
>> Sent: ‎8/‎20/‎2013 12:46 PM
>> To: MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] Higher relevance for newer documents?
>> 
>> 
>>   Thanks Mike.  I'd looked at a similar idea involving the copyright 
>> year (that was too coarse).  The number of weeks since some distant 
>> date is a pretty good idea.
>> 
>>   I suppose the biggest weakness of this algorithm is that it is 
>> necessarily linear.  You can't do an exponential decay where the 
>> quality of recent document drops off quickly and then levels of as 
>> they get older.  Though linear is better than nothing.
>> 
>>   Is there any downside to constantly increasing quality values?
>> The quality argument of xdmp:set-document-quality is a 32-bit xs:int.
>> Is the relevance boost from quality applied evenly across the range of 
>> possible values?
>> 
>>   Thanks.
>> 
>> ---
>> Ron Hitchens {[email protected]}  +44 7879 358212
>> 
>> 
>> On Aug 20, 2013, at 7:20 PM, Michael Blakeley <[email protected]> wrote:
>> 
>>> What about using a naturally-increasing number for quality?
>>> 
>>> For example the number of weeks since 1970:
>>> 
>>>   xs:integer(
>>>     (current-date() - xs:date('1970-01-01'))
>>>      div xs:dayTimeDuration("P7D"))
>>>   => 531
>>> 
>>> You can reduce the magnitude of the quality boost by increasing the bucket 
>>> size: 14D, 30D, etc. Or changing the start-date might also be useful.
>>> 
>>> No crawl is necessary, unless you change your mind about the boost 
>>> algorithm.
>>> 
>>> -- Mike
>>> 
>>> On 20 Aug 2013, at 11:10 , Ron Hitchens <[email protected]> wrote:
>>> 
>>>> 
>>>> What are the techniques out there for giving newer documents 
>>>> higher relevance?  My target is MarkLogic 5.x, but 6.x may be in 
>>>> play before long.
>>>> 
>>>> There are two schemes that I am aware of, neither of which feels 
>>>> very elegant:
>>>> 
>>>> 1) Give documents a high quality value when ingested.  Periodically 
>>>> crawl the content and for any document with positive quality, 
>>>> reduce its quality according to some algorithm until the quality reaches 
>>>> zero.
>>>> 
>>>> This gives the best control over "freshness", but has the 
>>>> disadvantage of causing potentially large numbers of updates on 
>>>> each pass with the attendant merges and disk I/O & CPU load.
>>>> 
>>>> 2) Replicate the "real" query n times, each and-ed with a 
>>>> time-based query against the insertion date.  All of these are 
>>>> or-ed together with descending weights for older dates.
>>>> 
>>>> This does't require changing documents to tweak their freshness.  
>>>> But it also means you have a stair-step function of n-steps, which 
>>>> may not be very precise - and which wouldn't scale very well for 
>>>> large values of n.  And unfortunately, since the queries would be 
>>>> time-based, you can't pre-register them ahead of time.
>>>> 
>>>> Any other clever techniques that you've used?
>>>> 
>>>> ---
>>>> Ron Hitchens {[email protected]}  +44 7879 358212
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>> 
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Higher relevance for newer documents?

Reply via email to