31 jan 2008 kl. 23.37 skrev Steve Rowe:

Karl, can you elaborate on what you think is wrong with Weka's instances implementation?


I used the word bloated but really meant that it was written to fit everything. Never heard of anyone acutally doing it, but Instances could potenitally be extended and tailor fitted. It is quite the static solution:

                                   <<creates>>
[Instances]<#>------>[FastVector]<- - - - - -[ARFFReader]

FastVector actally loads all data to memory. It's pretty fast, but not that optimal for all environments and data sets. I would rather see file persistency combined with some transparent cache and index.

I think Ted hits the nail on the head when he write:

The way that Colt does (most of) this is to use a higher-order API. Most users I have talked to were completely confused by this. I think the right answer is to require a small set of primitives for each implementation and inherit nice API much like AbstractMap provides lots of sugar over a spartan
Map implementation.

Wich is pretty much what I ment when I wrote:

We should talk about a unison data access API. No need for something fancy or speedy from the start, a seekable record reader might be enough for now. Lots of abstract layers to allow people adding support methods and use of any data source with optional levels of access optimization. An ARFF, an inverted index or what ever fits best with the algortihm you are about to pass the data to.

In some cases a direct link to the data source can make sense. All I need in for Baysian classification is the class feature frequency. That could for instance be pulled straight out of a Lucene index reader, like this: <http://issues.apache.org/jira/browse/LUCENE-1039>.



   karl

Reply via email to