[ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977192#comment-13977192
 ] 

Pat Ferrel commented on MAHOUT-1518:
------------------------------------

[~dlyubimov]   The dictionaries are Scala Maps:

He is associating Mahout int IDs with the column and row external string IDs. 
As I said above ultimately we need something like the Java BiHashMap in guava, 
maybe even use that class? This is so the export can get the string from the 
int. I'm unfamiliar with the Scala Map so maybe it does this if the values are 
unique too.

  def asOrderedDictionary(entries: Array[String]) = {
    var dictionary = Map[String,Int]()
    var index = 0
    for (entry <- entries) {
      dictionary += entry -> index
      index += 1
    }
    dictionary
  }

This is a very simple object, just a DRM in the Scala Mahout DSL sense, and two 
Scala Maps:

    IndexedDataset(drmInteractions, userIDDictionary, itemIDDictionary)

For the import/export a row iterator and two BiMaps are all that is needed. The 
row iterator can use the raw drm methods, or add some convenience methods.

Maybe it's best to add functionality as it is required? Not sure what a 
generalized full blown data frame-ish thing would be. Are general slice ops 
needed, ones that you can't get from the underlying drm?

Since this simple object contains a true rdd drm, any code that accepts it can 
grab the drm and use it directly then do whatever makes sense with the 
dictionaries. Transpose would swap them, for instance. Cooccurrence would take 
the column dictionaries or dictionary and create a new 
IndexedDistributedMatrix(drmInteractions, columnIDDictionaryB, 
columnIDDictionaryA) or  IndexedDistributedMatrix(drmInteractions, 
columnIDDictionaryA, columnIDDictionaryA) for self similarity.

> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to