Hi Gianmarco,

All the test instructions, test data & other details are updated on the
pull request

Thanks
Jay
https://github.com/jayadeepj

On Thu, Nov 5, 2015 at 12:50 PM, Gianmarco De Francisci Morales <
[email protected]> wrote:

> Thanks Jay,
>
> I'll test it this weekend. Do you have some instructions and data I could
> use to try it out?
>
> --
> Gianmarco
>
> On 4 November 2015 at 16:47, Jayadeep J <[email protected]> wrote:
>
> > Hi Gianmarco,
> >
> > I have implemented this functionality as per the suggestions and have
> > raised a pull request.
> >
> > The implementation details are as below.
> >
> > 1) A new AvroFileStream as a subclass of existing FileStream that will
> take
> > in the encoding format (json/binary) from command-line. It will use
> > InputStream  instead of current io Reader to handle Binary Streams.
> > 2) A common Loader interface to make the parsing of streams generic
> rather
> > than only ARFF
> > 3) A new AvroLoader abstract class in samoa-instances that will handle
> the
> > parsing of the Avro Generic Records from InputStream into SAMOA
> instances.
> > If even one attribute in the Avro schema has a null union (nullable
> > attribute) then it will be converted into  a SAMOA Sparse Instance else
> > DenseInstance
> > 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e.
> > AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & Avro
> > schema on initialization. They will use separate decoders to read from
> the
> > stream
> > 5) Appropriate changes in poms , Instances.java & ARFFLoader to use the
> new
> > Loader interface
> >
> > Though I have seen that the Travis build has failed. Couldn't see from
> the
> > logs if it is due to this code change
> >
> > Thanks
> > Jay
> > https://github.com/jayadeepj
> >
> > On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales <
> > [email protected]> wrote:
> >
> > > Hi Jay,
> > >
> > > 1) I agree custom data types would be overkill.
> > > I was thinking of the second option you mentioned, distinguishing it
> > > inside the code.
> > > So the parser code would expect either all values to be optional, or
> all
> > > values to be required.
> > >
> > > I think the plan you have in mind is quite reasonable.
> > > I don't have other suggestions right now.
> > >
> > > Thanks,
> > >
> > > --
> > > Gianmarco
> > >
> > > On 21 October 2015 at 11:39, Jayadeep J <[email protected]> wrote:
> > >
> > >> Hi Gianmarco,
> > >>
> > >> Thanks for your reply. Regarding the points you mentioned,
> > >>
> > >> 1) W.r.t  Sparse & Dense instances, I am trying to understand what you
> > >> meant by "prototypes". Did you mean creating custom Avro data types
> like
> > >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the
> > actual
> > >> data stored in the file (JSON encoded) may become heavy. For e.g for
> the
> > >> iris data-set, if we decide to use a 'SparseNumeric' type for
> > >> 'sepallength',
> > >>
> > >> {"name":
> > >>
> >
> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},
> > >>
> > >> the data may look like this,
> > >>
> > >>
> >
> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
> > >>
> > >>
> >
> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
> > >>
> > >> The complexity of a user with an existing Avro data to convert into a
> > >> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier if
> we
> > >> just distinguish it inside the code , say if at least one attribute in
> > the
> > >> metadata uses the generic Avro optionality (e.g ["null", "double"]),
> > then
> > >> we do readInstanceSparse() in the Loader and map correspondingly ? Or
> is
> > >> there some other complexity that I have not looked at?
> > >>
> > >> 2) Yes . Skipping the Date-type attributes will make it easier !
> > >>
> > >> Regarding the engineering aspects,
> > >>
> > >> We can have the Avro dependecy in the deployable jar of SAMOA. In the
> > >> code, may be
> > >>
> > >> 1) We could have an Avro equivalent of ArffFileStream.java &
> ArffLoader
> > >> 2) May be a different Reader altogether for handling binary stream
> > >> 3) A user option to switch between JSON/Binary encoding
> > >>
> > >> If there is a better way to do it, kindly advice.
> > >>
> > >> Thanks
> > >> Jay
> > >> https://github.com/jayadeepj
> > >>
> > >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales <
> > >> [email protected]> wrote:
> > >>
> > >>> Hi Jayadeep,
> > >>>
> > >>> I think it's pretty cool!
> > >>> If we get both Avro and Kafka support right, we can connect to almost
> > >>> anything.
> > >>>
> > >>> The document looks very comprehensive, you seem to have given a lot
> of
> > >>> thought to it.
> > >>> I am not extremely familiar with Avro myself, I've just used it a
> > couple
> > >>> of times, but I'll try to provide some suggestions.
> > >>>
> > >>> - The general idea of where and how to store data and meta-data seems
> > >>> right.
> > >>> - In general, all attributes in a sparse instance are optional, and
> all
> > >>> attributes in a dense instance are required. Maybe we want to be more
> > >>> granular than this in the future, but it seems that Avro supports a
> > >>> superset of these settings. We may want to have some defaults
> > "prototypes"
> > >>> in order to make mapping the current dense/sparse instances easy.
> > >>> - Right now we are not making use of Date-type attributes in SAMOA
> > >>> (there is no such thing in samoa-instances), so if it makes it easier
> > we
> > >>> could skip supporting it. Ideally we could have algorithms that
> respect
> > >>> event-time as provided by timestamps in the instances (as opposed to
> > >>> processing the event whenever it arrives), however we are not there
> > yet :)
> > >>>
> > >>> All the rest seems pretty straightforward.
> > >>>
> > >>> Moving to the more software-engineering oriented aspects, where would
> > we
> > >>> have dependencies for Avro? And how should they be deployed? Would
> they
> > >>> simply go inside the deployable uber-jar of SAMOA?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> --
> > >>> Gianmarco
> > >>>
> > >>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]> wrote:
> > >>>
> > >>>> Hi Gianmarco / All,
> > >>>>
> > >>>> I am working on an integration of SAMOA with Apache Avro. Basically
> I
> > >>>> want to use data stored in Avro Files to be used as input to SAMOA.
> > >>>>
> > >>>> As I understand, current SAMOA readers only support ARFF format. Do
> > you
> > >>>> think such a feature would be useful to SAMOA in general ? Avro
> > allows two
> > >>>> encodings for the data: Binary & JSON. Hence an Avro support may
> allow
> > >>>> users with JSON data also to use SAMOA.
> > >>>>
> > >>>> Based on the input given by @gdfm to @ctippur, I have prepared an
> > Input
> > >>>> Format document in Google Docs.
> > >>>>
> > >>>>
> > >>>>
> >
> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
> > >>>>
> > >>>>
> > >>>> Would it be possible for you to have a look and provide your
> valuable
> > >>>> suggestions ? Thanks
> > >>>>
> > >>>>
> > >>>> Thanks
> > >>>> Jay
> > >>>> https://github.com/jayadeepj
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Thanks
> > >> Jay
> > >>
> > >>
> > >> Jayadeep J
> > >> Mob: (+91) - 9176669142
> > >>
> > >
> > >
> >
> >
> > --
> > Thanks
> > Jay
> >
> >
> > Jayadeep J
> > Mob: (+91) - 9176669142
> >
>

Reply via email to