Hi Pedro, Representing data points as semi-structured entities (and then doing clustering, classification, or content-based recommendation) is certainly something on the horizon for Mahout (see MAHOUT-274<https://issues.apache.org/jira/browse/MAHOUT-274>for the start of the thinking around representing semi-structured entities, but search for content-based recommendations in the list archives to see the discussions about that). These are in the early works, however, and I would bet this doesn't get much support until the end of the summer or into the autumn.
SVM support is on the way: see MAHOUT-232<https://issues.apache.org/jira/browse/MAHOUT-232>, MAHOUT-237 <https://issues.apache.org/jira/browse/MAHOUT-237> and MAHOUT-227 <https://issues.apache.org/jira/browse/MAHOUT-227>, and should be fairly scalable by the end of the summer. -jake On Wed, May 5, 2010 at 6:29 AM, Pedro Oliveira <cpdom...@gmail.com> wrote: > Hi, > > On Wed, May 5, 2010 at 8:41 AM, Sean Owen <sro...@gmail.com> wrote: > > > You might have to be more specific. Support this is in the context of > > what, recommendations, clustering, ? > > > > Classification, clustering, and recommendation are the most important ones. > > > > > > You can probably fit such concepts into any framework with enough > > cleverness, so in that sense, as a general framework, sure I don't see > > why any algorithm couldn't eventually be applied to such data. > > > > This is a fairly specific kind of data model, so I am not sure if it > > would be something explicit supported in some special way. > > > > I'm currently working on a system that implements several non-parametric > machine learning techniques to work with multi-relational data (K-Medoids, > KNN, etc), and it works quite nicely with data that fits in memory. > However, > I have some new huge datasets, and I'll probably need to use some kind of > parallelization, and Mahout seems a good solution. The main purpose of my > email was to see if there's someone else out there working in the same > thing > as I. > From a quick look at the code, a straightforward solution would be to > define > a new type of Vector (it wouldn't be a vector in the mathematical sense, > just a way to save relational information about an instance), and some > DistanceMeasures to work with that vector. Then we could use distance based > techniques, such as canopy clustering and k-means. > Is there any plans to implement more distance-based (or kernel-based) > algorithms, such as SVMs and KNN? > > Cheers, > Pedro > > > > > > > > > On Wed, May 5, 2010 at 1:26 PM, Pedro Oliveira <cpdom...@gmail.com> > wrote: > > > Hi, > > > > > > I have a simple question: does Mahout supports, or plans to support, > > > multi-relational datasets? > > > I.e., datasets where each instance can have a variable number of values > > in a > > > attribute, and values can be other instances? > > > The basic example is a social network, where each person has several > > > attributes, and some attributes, like "knows", can have several > distinct > > > values, and these values are other persons. > > > This datasets are usually very sparse (there's lots of distinct > > attributes, > > > but each instance only has values for few of them), and the relational > > > information is very relevant (in the social network example, the > > > acquaintances of our acquaintances are relevant). > > > > > > > > > Cheers, > > > Pedro > > > > > >