Hi Shekar, I think the way to start is to define how the instance will be serialized in JSON. To do so, we need to answer a few questions: - How are the attribute IDs represented? Probably a simple int as the key is enough. - How do we represent metadata? I'd say that a single JSON instance at the beginning of the file could contain the needed metadata. For example, for each attribute we could have its domain (binary, nominal, real, etc...). For very large datasets this might be inefficient, so we might want to have a default (real) and a way to express ranges, (e.g., attributes from 0 to 10000 are all real).
There might be other issues that I am overlooking now, but in practice we need a 1:1 mapping from SAMOA instances to JSON. Once this is set, implementing a reader should be straightforward. The best way to start, imho, is to create a document where the format is described in all its details. See, e.g., https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format for VW. A simple Google doc would be good to start. Hope this helps! -- Gianmarco On 2 September 2015 at 09:31, Shekar Tippur <[email protected]> wrote: > Gianmarco, > > I really want to take up Samoa supporting json. Can you please point me to > somewhere I can start? > > - Shekar > > On Sun, Jul 12, 2015 at 12:20 AM, Gianmarco De Francisci Morales < > [email protected]> wrote: > > > Hi, > > > > The only reason is that we inherited the format from MOA. > > In practice, anything from which we can create an Instance from would be > > good enough. > > For example I'd like to support VW and svmLib formats. > > > > One caveat is that some algorithms require knowledge of the metadata for > > the datasets to preallocate some data structure. > > I would like to remove this dependency in the future, by having the > > algorithms completely adaptable. > > Though it's not as easy as it sounds :) > > > > Cheers, > > > > -- > > Gianmarco > > > > On 11 July 2015 at 16:46, Shekar Tippur <[email protected]> wrote: > > > > > Gianmarco > > > > > > Thanks for the response. Can you please specify the format? Can you > > please > > > explain the reason for keeping it in a specific format? > > > I would like contribute to kafka enhancement. I will look into the code > > > base you pointed out. > > > > > > Shekar > > > On Jul 11, 2015 1:36 AM, "Gianmarco De Francisci Morales" < > > [email protected] > > > > > > > wrote: > > > > > > > Hi Shekar, > > > > > > > > At the moment we do not support JSON data. > > > > The current readers support ARFF format, which is a CSV with some > > header. > > > > http://www.cs.waikato.ac.nz/ml/weka/arff.html > > > > Adding support for JSON is doable, but it should conform to a very > > > specific > > > > format. > > > > > > > > About Kafka, we support it as a transport via Samza, but we don't > have > > a > > > > reader for it right now. > > > > Adding it would be very valuable. If you wanted to work on it I'd be > > > happy > > > > to help. > > > > Have a look at org.apache.samoa.streams.fs.HDFSFileStreamSource, > > > > and org.apache.samoa.streams.ArffFileStream for some examples. > > > > > > > > Cheers, > > > > > > > > > > > > -- > > > > Gianmarco > > > > > > > > On 10 July 2015 at 01:18, Shekar Tippur <[email protected]> wrote: > > > > > > > > > Hello, > > > > > > > > > > I am trying to use Samoa/Samza combination to apply ML for a > dataset > > I > > > > have > > > > > in JSON format. > > > > > > > > > > This is the document I am following: > > > > > > > > > > > > > > > > > > > > https://samoa.incubator.apache.org/documentation/Executing-SAMOA-with-Apache-Samza.html > > > > > > > > > > Couple of questions: > > > > > 1. How do I point the input event to a Stream/Topic in Kafka? The > > data > > > is > > > > > in JSON. > > > > > 2. If I want to use historical data that is stored in a file, how > do > > I > > > > > point the job to read from a file and serialise as json? > > > > > > > > > > bin/samoa samza target/SAMOA-Samza-0.3.0-SNAPSHOT.jar > > > > > "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (??)" > > > > > > > > > > - Shekar > > > > > > > > > > > > > > >
