[
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977192#comment-13977192
]
Pat Ferrel commented on MAHOUT-1518:
------------------------------------
[~dlyubimov] The dictionaries are Scala Maps:
He is associating Mahout int IDs with the column and row external string IDs.
As I said above ultimately we need something like the Java BiHashMap in guava,
maybe even use that class? This is so the export can get the string from the
int. I'm unfamiliar with the Scala Map so maybe it does this if the values are
unique too.
def asOrderedDictionary(entries: Array[String]) = {
var dictionary = Map[String,Int]()
var index = 0
for (entry <- entries) {
dictionary += entry -> index
index += 1
}
dictionary
}
This is a very simple object, just a DRM in the Scala Mahout DSL sense, and two
Scala Maps:
IndexedDataset(drmInteractions, userIDDictionary, itemIDDictionary)
For the import/export a row iterator and two BiMaps are all that is needed. The
row iterator can use the raw drm methods, or add some convenience methods.
Maybe it's best to add functionality as it is required? Not sure what a
generalized full blown data frame-ish thing would be. Are general slice ops
needed, ones that you can't get from the underlying drm?
Since this simple object contains a true rdd drm, any code that accepts it can
grab the drm and use it directly then do whatever makes sense with the
dictionaries. Transpose would swap them, for instance. Cooccurrence would take
the column dictionaries or dictionary and create a new
IndexedDistributedMatrix(drmInteractions, columnIDDictionaryB,
columnIDDictionaryA) or IndexedDistributedMatrix(drmInteractions,
columnIDDictionaryA, columnIDDictionaryA) for self similarity.
> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
> Key: MAHOUT-1518
> URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_,
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)