The simplest way to explain this is to say it is the data headers. Here is a simple example:

Say column 1 is a numeric value and column 2 is a class value. Some
algorithms might only accept discrete values, but I know for a fact that
the numeric value is an integer between 1 and 10 and could thus be
treaded as discerete value even though it is not. I don't want to go
messing about in the headers of the physical data set, nor do I want to
transform the complete data set, instead I remap the first column and
state it is a discrete value in my logical data set.

The data definition model does not care if the attribute values are integers, string or what not. They are all objects and they can be transformed to the type it was mapped as. And the seperation of layers makes it very simple.

But most important, I really want to see a unison data model definition (typed instance headers) and a very simple abstract way to access physical data records (the instances) we can share between data tranformation suites, ML algorithms, feature selectors and what not.

Do you read UML? The JSRs have some great documentation then.

There is more to it that what I tried to explain here. And I probably didn't pick half of the ideas behind the JSR data models.


    karl

Grant Ingersoll skrev:
I haven't looked at JSRs. Can you explain the use cases a bit more? How it would be used in M/R, and in implementations? I like the sounds of it


On Feb 25, 2008, at 4:34 PM, Karl Wettin (JIRA) wrote:


[ https://issues.apache.org/jira/browse/MAHOUT-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-8:
-----------------------------

   Attachment: pseudo_jsr.txt

My question is, did anyone else take a closer look at the JSRs? I would very much like to hear what you people think of this data model. I'm quite attracted to it.

It says nothing about how data is stored, it is about roles and abstract access to physical instance data. And it seperates logical (the data set definition used by ML algorithms) from physical (the deta set definition describing the source data) model, allowing one to vitually transform the data set by mapping logical data to the physical data in any way without messing things up.

I now have this half baked pseudo implementation of this. It uses abstract classes rather than interfaces, and some of the interfaces have been merged to a single class. It would however not be a big deal to have it implement the interfaces if one wish. I feel some of the stuff in there is a bit overkill at this point, but I tried to follow the specs as well as I could (I replaced a bit of ad hoc enum classes with enums, etc).

There is no documentation, tests or anything concrete, just a bunch of classes I'm now popping in the JIRA to show what it could look like.

Actually, there is an early attempt at an abstract seekable physical data record reader. And an ARFF writer. They are sort of my dry coded thoughts. You can ignore them.


Data definition model
---------------------

               Key: MAHOUT-8
               URL: https://issues.apache.org/jira/browse/MAHOUT-8
           Project: Mahout
        Issue Type: New Feature
          Reporter: Karl Wettin
       Attachments: pseudo_jsr.txt


How do we define classes, attributes and instance data?
This has nothing to do with physical data records, this is about data types, roles, etc.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







Reply via email to