Hi, I think we should restart this conversation. Matthieu, do you think we can review the branch? Or do you want to do any update on it before?
Cheers, -- Gianmarco On 26 January 2015 at 16:20, Albert Bifet <[email protected]> wrote: > Hi Matthieu, > > Thanks for your answers! I agree with using double values to store > attribute information. I think we need to define how to maintain the > mapping, as some learners need to know if attributes are discrete or > numeric, in order to learn and do predictions, and how many values the > discrete attributes have. > > Cheers, Albert > > On Mon, Jan 26, 2015 at 7:33 PM, Matthieu Morel <[email protected]> wrote: > > > - discrete attributes are eventually mapped to double values, and > > that's the appropriate input to instances, in my understanding. My > > idea was to maintain the mapping in the feature extraction step, and > > share it in some way with the processing topology. > > > > - regarding performance in sparse instances, I haven't done any sort > > of benchmark yet. The implementation can be changed while keeping the > > same API. > > From what I see, on the one hand, in the current approach using an > > index array, we have the extra constraints that 1/ this index array > > must be sorted (adds building time), and 2/ we have to do a binary > > search for the index value (log(n)). > > On the other hand, there are some very efficient map implementations > > that we could reuse. For example, CERN's colt package, actually > > already imported in the mahout-collections ASF package. > > > > I hope this answers your questions, > > > > Matthieu > > > > > > On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <[email protected]> > > wrote: > > > Nice and simple API! Some things to comment: > > > > > > - how can we manage discrete attributes, for example attribute class: > > > "+","-"? > > > > > > - In sparse instances, is the performance of a map similar to the > > > performance of two arrays, one for indices and one for values? > > > > > > Albert > > > > > > On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel < > > [email protected]> > > > wrote: > > > > > >> I took a shot at drafting a simplified API for instances. > > >> https://github.com/matthieumorel/samoa/tree/new-instances > > >> > > >> As pointed out in this thread, the current API is too exhaustive, too > > >> tied to a specific implementation, and too tied to WEKA/MOA APIs. > > >> > > >> In addition, I feel the header/information does not belong to the > > >> instance. This is something which is used when parsing arff files > > >> where static information about the stream is available from the start. > > >> But for a real streaming use case, we should not make such assumption. > > >> Whatever is known at the begining should be loaded by the topology, > > >> but not included in the instances. Arff files can still be loaded and > > >> generate instances in the new format. Only the headers should be > > >> parsed separately. > > >> > > >> This proposal is a draft and single label only. It should be easy to > > >> add functionality suggested by Albert for multi labels. > > >> > > >> Feel free to comment! > > >> > > >> Matthieu > > >> > > >> > > >> > > >> > > >> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <[email protected]> > > >> wrote: > > >> > 1/ Learners as decision trees can deal with new instances that > arrive > > >> > with more label classes. New instances can arrive with new headers. > > >> > > > >> > 2/ To change class labels dynamically, we need to add a method > > >> > "setValue(int, string)" in the Attribute class to add dynamically > new > > >> > attribute values. > > >> > > > >> > 3/ The current design is being compatible with the methods in weka > > >> > instances. It could be nice to have a fresher design. I will need > some > > >> > help to have a simplified and fresher design of the instances as > I'm a > > >> > bit conditioned by the previous instance usage :) > > >> > > > >> > Thanks, > > >> > > > >> > Albert > > >> > > > >> > > > >> > > > >> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere > > >> > <[email protected]> wrote: > > >> >> Hey Matthieu, > > >> >> > > >> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel < > > [email protected]> > > >> wrote: > > >> >>> > > >> >>> I'm confused. From what I see the number of classes is currently > > fixed > > >> >>> in the instance header. As if it was static. I suppose you can > work > > >> >>> around that limitation with some hacks but I want to use a clean > API > > >> >>> for that. > > >> >>> > > >> >>> Or is there a recommended way I'm missing? > > >> >> > > >> >> Ah, I think I remember now what happened. As far as I encountered > > this > > >> situation, the data had say an .arff format with a header stating the > > >> number of class values, and the instance header was read from that, > > while > > >> the instances were then read by the line. > > >> >> > > >> >> I worked around that by just storing the class label seen in the > > >> instances on the fly when building a model, and ignored that field of > > the > > >> instance header. Sorry for the confusion! > > >> >> > > >> >> Cheers, > > >> >> Olivier > > >> >> > > >> >> > > >> > > >
