Hi Robin,

Thanks for quick response. I think I got the point. But there seems lots I/O
going on: writing the objects into files, and reading the file back into
memory. I know the current implementations focus on file as the storage. It
would be nice to have a unified model class (no matter binary format or no)
across the algorithms (classify, cluster, and CF), and there can be various
drivers transfer data from file, XML, memory, relational or no-relational
database. It will make the framework more flexible.

I understand the project is still at its early stage, and there are other
focuses. But I think the dataset is quite fundament for the framework.

Once again thank for your informative response. by the way, I am reading the
manning book you and Sean Owen working on, looking forward the future
chapters.

Yuan

On Tue, Jan 26, 2010 at 10:33 PM, Robin Anil <[email protected]> wrote:

> Hi Yuan, Bayes classifier takes only binary features. So inorder to make
> your User class into a dataset,You need to create a tab separated file with
> label as the key and space separated features as the value. Presence of a
> feature makes it true absence makes it false.
>
> e.g.  if you are classifying heart-attack prone v/s healthy
> individual(assuming from your data)
> take two labels heart-attack and healthy
>
> You will need to convert integer and double values and map them to boolean
> features
> say you have boolean features like
>
> Weight:40-50
> Weight:50-60
>
> Age:20-30
> Age:30-40
>
> For user A with age = 23 weight = 53 diabetes=false
> write the line
>
> healthy<TAB>Age:20-30 Weight:50-60
>
> For user B with age = 37 weight = 52 diabetes=true
>
> heart-attack<TAB>Age:30-40 Weight:50-60 diabetes
>
> You will have many such lines for each feature in your dataset file. Give
> the file path to the classifier and it learns the model for you.
>
> For now, the algorithm takes the data from a file and not from a memory
> datastructure and do not use vectors. Try the classification
> example(20newsgroups) to get an idea of how the classifier can be run
>
> Robin
>
> On Wed, Jan 27, 2010 at 8:56 AM, Yuan Wang <[email protected]> wrote:
>
> > Hi all,
> >
> > I am learning Mahout. It seems to me most the examples load dataset from
> > files using command line. I know Baynes classifier can work with HBase.
> >
> > Is there any way to build the dataset from scratch in Java Code?
> >
> > for example, there is a User class having four attributes: ID(data type
> is
> > long or String), age {int}, weight (double), and diabetes {boolean} .
> > There are 100 user objects in my memory,  is there way I can convert them
> > into any type of dataset that classifier algorithm can handle.
> >
> > I noticed there are vector class and InMemoryDataStore, but I don't how
> to
> > use them. If someone can give any hint or write down some pseudo code,
> that
> > would very helpful.
> >
> > Thanks,
> > Yuan
> >
>

Reply via email to