Is it really that complex? Either way I say we need something very low
level that describes how data should be treated and a simple random
access from any source. Matrices is something several layers above that.
Ted Dunning skrev:
The thing that brings me up short when reading things like this JSR is that
they have a LOT of mechanism here to explain something that is pretty simple
in a language like R with the data.frame object.
I am left with the question of what is going on with the complexity. Some
explanations that I could imagine include:
A) the complexity is optional and R has a simpler solution
B) the language of discourse is somehow evil and R is just as complex, but
it is somehow vastly easier to explain an R data.frame than it is to explain
what the JSR is talking about.
C) Java itself is somehow at fault and it is forcing complexity on the
problem that isn't necessary
D) I am clueless and R lacks the complexity, the JSR has it but it is all
necessary.
My gut says that (a) is the right answer. My ego causes me to discount (d).
My religion causes me to discount (mostly) (c). I would find it hard to
argue why (b) is not true.
Anybody else have an opinion?
On 2/28/08 11:40 AM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:
The simplest way to explain this is to say it is the data headers. Here
is a simple example:
Say column 1 is a numeric value and column 2 is a class value. Some
algorithms might only accept discrete values, but I know for a fact that
the numeric value is an integer between 1 and 10 and could thus be
treaded as discerete value even though it is not. I don't want to go
messing about in the headers of the physical data set, nor do I want to
transform the complete data set, instead I remap the first column and
state it is a discrete value in my logical data set.
The data definition model does not care if the attribute values are
integers, string or what not. They are all objects and they can be
transformed to the type it was mapped as. And the seperation of layers
makes it very simple.
But most important, I really want to see a unison data model definition
(typed instance headers) and a very simple abstract way to access
physical data records (the instances) we can share between data
tranformation suites, ML algorithms, feature selectors and what not.
Do you read UML? The JSRs have some great documentation then.
There is more to it that what I tried to explain here. And I probably
didn't pick half of the ideas behind the JSR data models.
karl
Grant Ingersoll skrev:
I haven't looked at JSRs. Can you explain the use cases a bit more?
How it would be used in M/R, and in implementations? I like the sounds
of it
On Feb 25, 2008, at 4:34 PM, Karl Wettin (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-8?page=com.atlassian.jira.plugi
n.system.issuetabpanels:all-tabpanel ]
Karl Wettin updated MAHOUT-8:
-----------------------------
Attachment: pseudo_jsr.txt
My question is, did anyone else take a closer look at the JSRs? I
would very much like to hear what you people think of this data model.
I'm quite attracted to it.
It says nothing about how data is stored, it is about roles and
abstract access to physical instance data. And it seperates logical
(the data set definition used by ML algorithms) from physical (the
deta set definition describing the source data) model, allowing one to
vitually transform the data set by mapping logical data to the
physical data in any way without messing things up.
I now have this half baked pseudo implementation of this. It uses
abstract classes rather than interfaces, and some of the interfaces
have been merged to a single class. It would however not be a big deal
to have it implement the interfaces if one wish. I feel some of the
stuff in there is a bit overkill at this point, but I tried to follow
the specs as well as I could (I replaced a bit of ad hoc enum classes
with enums, etc).
There is no documentation, tests or anything concrete, just a bunch of
classes I'm now popping in the JIRA to show what it could look like.
Actually, there is an early attempt at an abstract seekable physical
data record reader. And an ARFF writer. They are sort of my dry coded
thoughts. You can ignore them.
Data definition model
---------------------
Key: MAHOUT-8
URL: https://issues.apache.org/jira/browse/MAHOUT-8
Project: Mahout
Issue Type: New Feature
Reporter: Karl Wettin
Attachments: pseudo_jsr.txt
How do we define classes, attributes and instance data?
This has nothing to do with physical data records, this is about data
types, roles, etc.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ