Hi Gianmarco, I have implemented this functionality as per the suggestions and have raised a pull request.
The implementation details are as below. 1) A new AvroFileStream as a subclass of existing FileStream that will take in the encoding format (json/binary) from command-line. It will use InputStream instead of current io Reader to handle Binary Streams. 2) A common Loader interface to make the parsing of streams generic rather than only ARFF 3) A new AvroLoader abstract class in samoa-instances that will handle the parsing of the Avro Generic Records from InputStream into SAMOA instances. If even one attribute in the Avro schema has a null union (nullable attribute) then it will be converted into a SAMOA Sparse Instance else DenseInstance 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e. AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & Avro schema on initialization. They will use separate decoders to read from the stream 5) Appropriate changes in poms , Instances.java & ARFFLoader to use the new Loader interface Though I have seen that the Travis build has failed. Couldn't see from the logs if it is due to this code change Thanks Jay https://github.com/jayadeepj On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales < [email protected]> wrote: > Hi Jay, > > 1) I agree custom data types would be overkill. > I was thinking of the second option you mentioned, distinguishing it > inside the code. > So the parser code would expect either all values to be optional, or all > values to be required. > > I think the plan you have in mind is quite reasonable. > I don't have other suggestions right now. > > Thanks, > > -- > Gianmarco > > On 21 October 2015 at 11:39, Jayadeep J <[email protected]> wrote: > >> Hi Gianmarco, >> >> Thanks for your reply. Regarding the points you mentioned, >> >> 1) W.r.t Sparse & Dense instances, I am trying to understand what you >> meant by "prototypes". Did you mean creating custom Avro data types like >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the actual >> data stored in the file (JSON encoded) may become heavy. For e.g for the >> iris data-set, if we decide to use a 'SparseNumeric' type for >> 'sepallength', >> >> {"name": >> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]}, >> >> the data may look like this, >> >> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"} >> >> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"} >> >> The complexity of a user with an existing Avro data to convert into a >> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier if we >> just distinguish it inside the code , say if at least one attribute in the >> metadata uses the generic Avro optionality (e.g ["null", "double"]), then >> we do readInstanceSparse() in the Loader and map correspondingly ? Or is >> there some other complexity that I have not looked at? >> >> 2) Yes . Skipping the Date-type attributes will make it easier ! >> >> Regarding the engineering aspects, >> >> We can have the Avro dependecy in the deployable jar of SAMOA. In the >> code, may be >> >> 1) We could have an Avro equivalent of ArffFileStream.java & ArffLoader >> 2) May be a different Reader altogether for handling binary stream >> 3) A user option to switch between JSON/Binary encoding >> >> If there is a better way to do it, kindly advice. >> >> Thanks >> Jay >> https://github.com/jayadeepj >> >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales < >> [email protected]> wrote: >> >>> Hi Jayadeep, >>> >>> I think it's pretty cool! >>> If we get both Avro and Kafka support right, we can connect to almost >>> anything. >>> >>> The document looks very comprehensive, you seem to have given a lot of >>> thought to it. >>> I am not extremely familiar with Avro myself, I've just used it a couple >>> of times, but I'll try to provide some suggestions. >>> >>> - The general idea of where and how to store data and meta-data seems >>> right. >>> - In general, all attributes in a sparse instance are optional, and all >>> attributes in a dense instance are required. Maybe we want to be more >>> granular than this in the future, but it seems that Avro supports a >>> superset of these settings. We may want to have some defaults "prototypes" >>> in order to make mapping the current dense/sparse instances easy. >>> - Right now we are not making use of Date-type attributes in SAMOA >>> (there is no such thing in samoa-instances), so if it makes it easier we >>> could skip supporting it. Ideally we could have algorithms that respect >>> event-time as provided by timestamps in the instances (as opposed to >>> processing the event whenever it arrives), however we are not there yet :) >>> >>> All the rest seems pretty straightforward. >>> >>> Moving to the more software-engineering oriented aspects, where would we >>> have dependencies for Avro? And how should they be deployed? Would they >>> simply go inside the deployable uber-jar of SAMOA? >>> >>> Thanks, >>> >>> -- >>> Gianmarco >>> >>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]> wrote: >>> >>>> Hi Gianmarco / All, >>>> >>>> I am working on an integration of SAMOA with Apache Avro. Basically I >>>> want to use data stored in Avro Files to be used as input to SAMOA. >>>> >>>> As I understand, current SAMOA readers only support ARFF format. Do you >>>> think such a feature would be useful to SAMOA in general ? Avro allows two >>>> encodings for the data: Binary & JSON. Hence an Avro support may allow >>>> users with JSON data also to use SAMOA. >>>> >>>> Based on the input given by @gdfm to @ctippur, I have prepared an Input >>>> Format document in Google Docs. >>>> >>>> >>>> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing >>>> >>>> >>>> Would it be possible for you to have a look and provide your valuable >>>> suggestions ? Thanks >>>> >>>> >>>> Thanks >>>> Jay >>>> https://github.com/jayadeepj >>>> >>> >>> >> >> >> -- >> Thanks >> Jay >> >> >> Jayadeep J >> Mob: (+91) - 9176669142 >> > > -- Thanks Jay Jayadeep J Mob: (+91) - 9176669142
