Re: rowsimilarity

Ted Dunning Thu, 18 Sep 2014 11:17:15 -0700

LLR with text is commonly done (that is where it comes from).

The simple approach would be to have sentences be users and words be items.
 This will result in word-word connections.


This doesn't directly give document-document similarities.  That could be
done by transposing the original data (word is user, document is item) but
I don't quite understand how to interpret that.  Another approach is simply
using term weighting and document normalization and scoring every doc
against every other.  That comes down to a matrix multiplication which is
very similar to the transposed LLR problem so that may give an
interpretation.


On Mon, Aug 25, 2014 at 10:15 AM, Pat Ferrel <[email protected]> wrote:

> LLR with text or non-interaction data. What do we use for counts? Do we
> care how many times a token is seen in a doc for instance or do we just
> look to see if it was seen. I assume the later, which means we need a new
> numNonZeroElementsPerRow several places in math-scala, right?
>
> All the same questions are going to come up over this as did for
> numNonZeroElementsPerColumn so please speak now or I’ll just mirror its
> implementation.
>
>
> On Aug 25, 2014, at 9:38 AM, Pat Ferrel <[email protected]> wrote:
>
> Turning itemsimilarity into rowsimilarity if fairly simple but requires
> altering CooccurrenceAnalysis.cooccurrence to swap the transposes and
> calculate the LLR values for rows rather than columns. The input will be
> something like a DRM. Row similarity does something like AA’ with LLR
> weighting and uses similar downsampling as I take it from the Hadoop code.
> Let me know if I’m on the wrong track here.
>
> With the new application ID preserving code the following input could be
> directly processed (it’s my test case)
>
> doc1\tNow is the time for all good people to come to aid of their party
> doc2\tNow is the time for all good people to come to aid of their country
> doc3\tNow is the time for all good people to come to aid of their hood
> doc4\tNow is the time for all good people to come to aid of their friends
> doc5\tNow is the time for all good people to come to aid of their looser
> brother
> doc6\tThe quick brown fox jumped over the lazy dog
> doc7\tThe quick brown fox jumped over the lazy boy
> doc8\tThe quick brown fox jumped over the lazy cat
> doc9\tThe quick brown fox jumped over the lazy wolverine
> doc10\tThe quick brown fox jumped over the lazy cantelope
>
> The output will be something like the following, with or without LLR
> strengths.
> doc1\tdoc2 doc3 doc4 doc5
> …
> doc6\tdoc7 doc8 doc9 doc10
> ...
>
> It would be pretty easy to tack on a text analyzer from lucene to turn
> this into a full function doc similarity job since LLR doesn’t need TF-IDF.
>
> One question is: is there any reason to do the cross-similarity in RSJ, so
> [AB’]? I can’t picture what this would mean so am assuming the answer is no.
>
>
>

Re: rowsimilarity

Reply via email to