Re: (De-)serializing collections/datasets

Karl Wettin Fri, 01 Feb 2008 04:04:56 -0800


31 jan 2008 kl. 23.37 skrev Steve Rowe:

Karl, can you elaborate on what you think is wrong with Weka'sinstances implementation?

I used the word bloated but really meant that it was written to fiteverything. Never heard of anyone acutally doing it, but Instancescould potenitally be extended and tailor fitted. It is quite thestatic solution:


                                   <<creates>>
[Instances]<#>------>[FastVector]<- - - - - -[ARFFReader]

FastVector actally loads all data to memory. It's pretty fast, but notthat optimal for all environments and data sets. I would rather seefile persistency combined with some transparent cache and index.


I think Ted hits the nail on the head when he write:

The way that Colt does (most of) this is to use a higher-orderAPI. Mostusers I have talked to were completely confused by this. I thinkthe rightanswer is to require a small set of primitives for eachimplementation andinherit nice API much like AbstractMap provides lots of sugar overa spartan
Map implementation.


Wich is pretty much what I ment when I wrote:

We should talk about a unison data access API. No need forsomething fancy or speedy from the start, a seekable record readermight be enough for now. Lots of abstract layers to allow peopleadding support methods and use of any data source with optionallevels of access optimization. An ARFF, an inverted index or whatever fits best with the algortihm you are about to pass the data to.

In some cases a direct link to the data source can make sense. All Ineed in for Baysian classification is the class feature frequency.That could for instance be pulled straight out of a Lucene indexreader, like this: <http://issues.apache.org/jira/browse/LUCENE-1039>.




   karl

Re: (De-)serializing collections/datasets

Reply via email to