31 jan 2008 kl. 23.37 skrev Steve Rowe:
Karl, can you elaborate on what you think is wrong with Weka's
instances implementation?
I used the word bloated but really meant that it was written to fit
everything. Never heard of anyone acutally doing it, but Instances
could potenitally be extended and tailor fitted. It is quite the
static solution:
<<creates>>
[Instances]<#>------>[FastVector]<- - - - - -[ARFFReader]
FastVector actally loads all data to memory. It's pretty fast, but not
that optimal for all environments and data sets. I would rather see
file persistency combined with some transparent cache and index.
I think Ted hits the nail on the head when he write:
The way that Colt does (most of) this is to use a higher-order
API. Most
users I have talked to were completely confused by this. I think
the right
answer is to require a small set of primitives for each
implementation and
inherit nice API much like AbstractMap provides lots of sugar over
a spartan
Map implementation.
Wich is pretty much what I ment when I wrote:
We should talk about a unison data access API. No need for
something fancy or speedy from the start, a seekable record reader
might be enough for now. Lots of abstract layers to allow people
adding support methods and use of any data source with optional
levels of access optimization. An ARFF, an inverted index or what
ever fits best with the algortihm you are about to pass the data to.
In some cases a direct link to the data source can make sense. All I
need in for Baysian classification is the class feature frequency.
That could for instance be pulled straight out of a Lucene index
reader, like this: <http://issues.apache.org/jira/browse/LUCENE-1039>.
karl