BTW the cooccurrence code is going into RSJ too and there are uses of that where cosine is expected. I don’t know how to think about cross-cosine. Is there an argument for LLR only in RSJ?
On Aug 6, 2014, at 5:20 PM, Sebastian Schelter <ssc.o...@googlemail.com> wrote: Sounds good to me. -s Am 06.08.2014 17:15 schrieb "Dmitriy Lyubimov" <dlie...@gmail.com>: > what i mean here i probably need to refactor it a little so that there's > part of algorithm that accepts co-occurrence input directly and which is > somewhat decoupled from the part that accepts u x item input and does > downsampling and co-occurrence construction. So i could do some > customization of my own to co-occurrence construction. Would that be > reasonable if i do that? > > > On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > >> Asking because i am considering pulling this implementation but for some >> (mostly political) reasons people want to try different things here. >> >> I may also have to start with a different way of constructing >> co-occurrences, and may do a few optimizations there (i.e. priority queue >> queing/enqueing does twice the work it really needs to do etc.) >> >> >> >> >> On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter < >> ssc.o...@googlemail.com> wrote: >> >>> I chose against porting all the similarity measures to the dsl version > of >>> the cooccurrence analysis for two reasons. First, adding the measures > in a >>> generalizable way makes the code superhard to read. Second, in > practice, I >>> have never seen something giving better results than llr. As Ted pointed >>> out, a lot of the foundations of using similarity measures comes from >>> wanting to predict ratings, which people never do in practice. I think > we >>> should restrict ourselves to approaches that work with implicit, >>> count-like >>> data. >>> >>> -s >>> Am 06.08.2014 16:58 schrieb "Ted Dunning" <ted.dunn...@gmail.com>: >>> >>>> On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov <dlie...@gmail.com> >>>> wrote: >>>> >>>>> On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov <dlie...@gmail.com >> >>>>> wrote: >>>>> >>>>> I suppose in that context LLR is considered a distance (higher > scores >>>> mean >>>>>> more `distant` items, co-occurring by chance only)? >>>>>> >>>>> >>>>> Self-correction on this one -- having given a quick look at llr > paper >>>>> again, it looks like it is actually a similarity (higher scores >>> meaning >>>>> more stable co-occurrences, i.e. it moves in the opposite direction > of >>>>> p-value if it had been a classic test >>>>> >>>> >>>> LLR is a classic test. It is essentially Pearson's chi^2 test without >>> the >>>> normal approximation. See my papers[1][2] introducing the test into >>>> computational linguistics (which ultimately brought it into all kinds > of >>>> fields including recommendations) and also references for the G^2 >>> test[3]. >>>> >>>> [1] http://www.aclweb.org/anthology/J93-1003 >>>> [2] http://arxiv.org/abs/1207.1847 >>>> [3] http://en.wikipedia.org/wiki/G-test >>>> >>> >> >> >