Thanks Jay, I updated the wiki to include the links ( https://cwiki.apache.org/confluence/display/SAMOA/Sample+Avro+Datasets). Unfortunately, the files are too big to attach directly in the wiki.
Cheers, -- Gianmarco On 25 December 2015 at 16:48, Jayadeep J <[email protected]> wrote: > Hi Gianmarco, > > I have created a PR with the documentation for website docs. > > The zipped test data-sets are 2 files of 20 MB each for JSON & Binary. If > you can attach it in wiki, then that is great. I don't have access to > create a wiki page I guess. The links to download the files are below > > > https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing > > > > https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing > > > Thanks > Jay > > > > On Mon, Nov 30, 2015 at 9:09 PM, Gianmarco De Francisci Morales < > [email protected]> wrote: > > > Thanks Jayadeep, > > > > I think the docs could go in the website docs. > > Not sure about the test datasets. Maybe as attachments in the wiki if > they > > are not too big? > > > > -- > > Gianmarco > > > > On 30 November 2015 at 14:38, Jayadeep J <[email protected]> wrote: > > > > > Hi Gianmarco, > > > > > > I have closed the PR > > > > > > Let me know where to put the instructions for using AVRO, Input format > > > document & test data sets ??? > > > > > > > > > > > > https://drive.google.com/file/d/0B844rHJZHzKMdk5oMHZWREdxMnM/view?usp=sharing > > > > > > > > > > > > > > > https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing > > > > > > > > > > > > > > > https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing > > > > > > > > > Thanks > > > Jay > > > https://github.com/jayadeepj > > > > > > > > > On Thu, Nov 5, 2015 at 3:05 PM, Jayadeep J <[email protected]> > wrote: > > > > > > > Hi Gianmarco, > > > > > > > > All the test instructions, test data & other details are updated on > the > > > > pull request > > > > > > > > Thanks > > > > Jay > > > > https://github.com/jayadeepj > > > > > > > > On Thu, Nov 5, 2015 at 12:50 PM, Gianmarco De Francisci Morales < > > > > [email protected]> wrote: > > > > > > > >> Thanks Jay, > > > >> > > > >> I'll test it this weekend. Do you have some instructions and data I > > > could > > > >> use to try it out? > > > >> > > > >> -- > > > >> Gianmarco > > > >> > > > >> On 4 November 2015 at 16:47, Jayadeep J <[email protected]> > wrote: > > > >> > > > >> > Hi Gianmarco, > > > >> > > > > >> > I have implemented this functionality as per the suggestions and > > have > > > >> > raised a pull request. > > > >> > > > > >> > The implementation details are as below. > > > >> > > > > >> > 1) A new AvroFileStream as a subclass of existing FileStream that > > will > > > >> take > > > >> > in the encoding format (json/binary) from command-line. It will > use > > > >> > InputStream instead of current io Reader to handle Binary > Streams. > > > >> > 2) A common Loader interface to make the parsing of streams > generic > > > >> rather > > > >> > than only ARFF > > > >> > 3) A new AvroLoader abstract class in samoa-instances that will > > handle > > > >> the > > > >> > parsing of the Avro Generic Records from InputStream into SAMOA > > > >> instances. > > > >> > If even one attribute in the Avro schema has a null union > (nullable > > > >> > attribute) then it will be converted into a SAMOA Sparse Instance > > > else > > > >> > DenseInstance > > > >> > 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e. > > > >> > AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & > > Avro > > > >> > schema on initialization. They will use separate decoders to read > > from > > > >> the > > > >> > stream > > > >> > 5) Appropriate changes in poms , Instances.java & ARFFLoader to > use > > > the > > > >> new > > > >> > Loader interface > > > >> > > > > >> > Though I have seen that the Travis build has failed. Couldn't see > > from > > > >> the > > > >> > logs if it is due to this code change > > > >> > > > > >> > Thanks > > > >> > Jay > > > >> > https://github.com/jayadeepj > > > >> > > > > >> > On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales < > > > >> > [email protected]> wrote: > > > >> > > > > >> > > Hi Jay, > > > >> > > > > > >> > > 1) I agree custom data types would be overkill. > > > >> > > I was thinking of the second option you mentioned, > distinguishing > > it > > > >> > > inside the code. > > > >> > > So the parser code would expect either all values to be > optional, > > or > > > >> all > > > >> > > values to be required. > > > >> > > > > > >> > > I think the plan you have in mind is quite reasonable. > > > >> > > I don't have other suggestions right now. > > > >> > > > > > >> > > Thanks, > > > >> > > > > > >> > > -- > > > >> > > Gianmarco > > > >> > > > > > >> > > On 21 October 2015 at 11:39, Jayadeep J <[email protected]> > > > wrote: > > > >> > > > > > >> > >> Hi Gianmarco, > > > >> > >> > > > >> > >> Thanks for your reply. Regarding the points you mentioned, > > > >> > >> > > > >> > >> 1) W.r.t Sparse & Dense instances, I am trying to understand > > what > > > >> you > > > >> > >> meant by "prototypes". Did you mean creating custom Avro data > > types > > > >> like > > > >> > >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If > yes, > > > the > > > >> > actual > > > >> > >> data stored in the file (JSON encoded) may become heavy. For > e.g > > > for > > > >> the > > > >> > >> iris data-set, if we decide to use a 'SparseNumeric' type for > > > >> > >> 'sepallength', > > > >> > >> > > > >> > >> {"name": > > > >> > >> > > > >> > > > > >> > > > > > > "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]}, > > > >> > >> > > > >> > >> the data may look like this, > > > >> > >> > > > >> > >> > > > >> > > > > >> > > > > > > {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"} > > > >> > >> > > > >> > >> > > > >> > > > > >> > > > > > > {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"} > > > >> > >> > > > >> > >> The complexity of a user with an existing Avro data to convert > > > into a > > > >> > >> 'SAMOA compatible Avro' may become painful. Wouldn't it be > easier > > > if > > > >> we > > > >> > >> just distinguish it inside the code , say if at least one > > attribute > > > >> in > > > >> > the > > > >> > >> metadata uses the generic Avro optionality (e.g ["null", > > > "double"]), > > > >> > then > > > >> > >> we do readInstanceSparse() in the Loader and map > correspondingly > > ? > > > >> Or is > > > >> > >> there some other complexity that I have not looked at? > > > >> > >> > > > >> > >> 2) Yes . Skipping the Date-type attributes will make it easier > ! > > > >> > >> > > > >> > >> Regarding the engineering aspects, > > > >> > >> > > > >> > >> We can have the Avro dependecy in the deployable jar of SAMOA. > In > > > the > > > >> > >> code, may be > > > >> > >> > > > >> > >> 1) We could have an Avro equivalent of ArffFileStream.java & > > > >> ArffLoader > > > >> > >> 2) May be a different Reader altogether for handling binary > > stream > > > >> > >> 3) A user option to switch between JSON/Binary encoding > > > >> > >> > > > >> > >> If there is a better way to do it, kindly advice. > > > >> > >> > > > >> > >> Thanks > > > >> > >> Jay > > > >> > >> https://github.com/jayadeepj > > > >> > >> > > > >> > >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci > Morales > > < > > > >> > >> [email protected]> wrote: > > > >> > >> > > > >> > >>> Hi Jayadeep, > > > >> > >>> > > > >> > >>> I think it's pretty cool! > > > >> > >>> If we get both Avro and Kafka support right, we can connect to > > > >> almost > > > >> > >>> anything. > > > >> > >>> > > > >> > >>> The document looks very comprehensive, you seem to have given > a > > > lot > > > >> of > > > >> > >>> thought to it. > > > >> > >>> I am not extremely familiar with Avro myself, I've just used > it > > a > > > >> > couple > > > >> > >>> of times, but I'll try to provide some suggestions. > > > >> > >>> > > > >> > >>> - The general idea of where and how to store data and > meta-data > > > >> seems > > > >> > >>> right. > > > >> > >>> - In general, all attributes in a sparse instance are > optional, > > > and > > > >> all > > > >> > >>> attributes in a dense instance are required. Maybe we want to > be > > > >> more > > > >> > >>> granular than this in the future, but it seems that Avro > > supports > > > a > > > >> > >>> superset of these settings. We may want to have some defaults > > > >> > "prototypes" > > > >> > >>> in order to make mapping the current dense/sparse instances > > easy. > > > >> > >>> - Right now we are not making use of Date-type attributes in > > SAMOA > > > >> > >>> (there is no such thing in samoa-instances), so if it makes it > > > >> easier > > > >> > we > > > >> > >>> could skip supporting it. Ideally we could have algorithms > that > > > >> respect > > > >> > >>> event-time as provided by timestamps in the instances (as > > opposed > > > to > > > >> > >>> processing the event whenever it arrives), however we are not > > > there > > > >> > yet :) > > > >> > >>> > > > >> > >>> All the rest seems pretty straightforward. > > > >> > >>> > > > >> > >>> Moving to the more software-engineering oriented aspects, > where > > > >> would > > > >> > we > > > >> > >>> have dependencies for Avro? And how should they be deployed? > > Would > > > >> they > > > >> > >>> simply go inside the deployable uber-jar of SAMOA? > > > >> > >>> > > > >> > >>> Thanks, > > > >> > >>> > > > >> > >>> -- > > > >> > >>> Gianmarco > > > >> > >>> > > > >> > >>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]> > > > >> wrote: > > > >> > >>> > > > >> > >>>> Hi Gianmarco / All, > > > >> > >>>> > > > >> > >>>> I am working on an integration of SAMOA with Apache Avro. > > > >> Basically I > > > >> > >>>> want to use data stored in Avro Files to be used as input to > > > SAMOA. > > > >> > >>>> > > > >> > >>>> As I understand, current SAMOA readers only support ARFF > > format. > > > Do > > > >> > you > > > >> > >>>> think such a feature would be useful to SAMOA in general ? > Avro > > > >> > allows two > > > >> > >>>> encodings for the data: Binary & JSON. Hence an Avro support > > may > > > >> allow > > > >> > >>>> users with JSON data also to use SAMOA. > > > >> > >>>> > > > >> > >>>> Based on the input given by @gdfm to @ctippur, I have > prepared > > an > > > >> > Input > > > >> > >>>> Format document in Google Docs. > > > >> > >>>> > > > >> > >>>> > > > >> > >>>> > > > >> > > > > >> > > > > > > https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing > > > >> > >>>> > > > >> > >>>> > > > >> > >>>> Would it be possible for you to have a look and provide your > > > >> valuable > > > >> > >>>> suggestions ? Thanks > > > >> > >>>> > > > >> > >>>> > > > >> > >>>> Thanks > > > >> > >>>> Jay > > > >> > >>>> https://github.com/jayadeepj > > > >> > >>>> > > > >> > >>> > > > >> > >>> > > > >> > >> > > > >> > >> > > > >> > >> -- > > > >> > >> Thanks > > > >> > >> Jay > > > >> > >> > > > >> > >> > > > >> > > > > > > > > > >
