Re: Multi-relational data

Jake Mannix Wed, 05 May 2010 06:40:57 -0700

Hi Pedro,

  Representing data points as semi-structured entities (and then doing
clustering, classification, or content-based recommendation) is certainly
something on the horizon for Mahout (see
MAHOUT-274<https://issues.apache.org/jira/browse/MAHOUT-274>for the
start
of the thinking around representing semi-structured entities, but search
for content-based recommendations in the list archives to see the
discussions about that).  These are in the early works, however, and
I would bet this doesn't get much support until the end of the summer
or into the autumn.


  SVM support is on the way: see
MAHOUT-232<https://issues.apache.org/jira/browse/MAHOUT-232>,
MAHOUT-237 <https://issues.apache.org/jira/browse/MAHOUT-237> and
MAHOUT-227 <https://issues.apache.org/jira/browse/MAHOUT-227>, and should be
fairly scalable by the end of the summer.

  -jake

On Wed, May 5, 2010 at 6:29 AM, Pedro Oliveira <cpdom...@gmail.com> wrote:

> Hi,
>
> On Wed, May 5, 2010 at 8:41 AM, Sean Owen <sro...@gmail.com> wrote:
>
> > You might have to be more specific. Support this is in the context of
> > what, recommendations, clustering, ?
> >
>
> Classification, clustering, and recommendation are the most important ones.
>
>
> >
> > You can probably fit such concepts into any framework with enough
> > cleverness, so in that sense, as a general framework, sure I don't see
> > why any algorithm couldn't eventually be applied to such data.
> >
> > This is a fairly specific kind of data model, so I am not sure if it
> > would be something explicit supported in some special way.
> >
>
> I'm currently working on a system that implements several non-parametric
> machine learning techniques to work with multi-relational data (K-Medoids,
> KNN, etc), and it works quite nicely with data that fits in memory.
> However,
> I have some new huge datasets, and I'll probably need to use some kind of
> parallelization, and Mahout seems a good solution. The main purpose of my
> email was to see if there's someone else out there working in the same
> thing
> as I.
> From a quick look at the code, a straightforward solution would be to
> define
> a new type of Vector (it wouldn't be a vector in the mathematical sense,
> just a way to save relational information about an instance), and some
> DistanceMeasures to work with that vector. Then we could use distance based
> techniques, such as canopy clustering and k-means.
> Is there any plans to implement more distance-based (or kernel-based)
> algorithms, such as SVMs and KNN?
>
> Cheers,
> Pedro
>
>
>
> >
> >
> > On Wed, May 5, 2010 at 1:26 PM, Pedro Oliveira <cpdom...@gmail.com>
> wrote:
> > > Hi,
> > >
> > > I have a simple question: does Mahout supports, or plans to support,
> > > multi-relational datasets?
> > > I.e., datasets where each instance can have a variable number of values
> > in a
> > > attribute, and values can be other instances?
> > > The basic example is a social network, where each person has several
> > > attributes, and some attributes, like "knows", can have several
> distinct
> > > values, and these values are other persons.
> > > This datasets are usually very sparse (there's lots of distinct
> > attributes,
> > > but each instance only has values for few of them), and the relational
> > > information is very relevant (in the social network example, the
> > > acquaintances of our acquaintances are relevant).
> > >
> > >
> > > Cheers,
> > > Pedro
> > >
> >
>

Re: Multi-relational data

Reply via email to