Thanks Jay,

I'll test it this weekend. Do you have some instructions and data I could
use to try it out?

--
Gianmarco

On 4 November 2015 at 16:47, Jayadeep J <[email protected]> wrote:

> Hi Gianmarco,
>
> I have implemented this functionality as per the suggestions and have
> raised a pull request.
>
> The implementation details are as below.
>
> 1) A new AvroFileStream as a subclass of existing FileStream that will take
> in the encoding format (json/binary) from command-line. It will use
> InputStream  instead of current io Reader to handle Binary Streams.
> 2) A common Loader interface to make the parsing of streams generic rather
> than only ARFF
> 3) A new AvroLoader abstract class in samoa-instances that will handle the
> parsing of the Avro Generic Records from InputStream into SAMOA instances.
> If even one attribute in the Avro schema has a null union (nullable
> attribute) then it will be converted into  a SAMOA Sparse Instance else
> DenseInstance
> 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e.
> AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & Avro
> schema on initialization. They will use separate decoders to read from the
> stream
> 5) Appropriate changes in poms , Instances.java & ARFFLoader to use the new
> Loader interface
>
> Though I have seen that the Travis build has failed. Couldn't see from the
> logs if it is due to this code change
>
> Thanks
> Jay
> https://github.com/jayadeepj
>
> On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales <
> [email protected]> wrote:
>
> > Hi Jay,
> >
> > 1) I agree custom data types would be overkill.
> > I was thinking of the second option you mentioned, distinguishing it
> > inside the code.
> > So the parser code would expect either all values to be optional, or all
> > values to be required.
> >
> > I think the plan you have in mind is quite reasonable.
> > I don't have other suggestions right now.
> >
> > Thanks,
> >
> > --
> > Gianmarco
> >
> > On 21 October 2015 at 11:39, Jayadeep J <[email protected]> wrote:
> >
> >> Hi Gianmarco,
> >>
> >> Thanks for your reply. Regarding the points you mentioned,
> >>
> >> 1) W.r.t  Sparse & Dense instances, I am trying to understand what you
> >> meant by "prototypes". Did you mean creating custom Avro data types like
> >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the
> actual
> >> data stored in the file (JSON encoded) may become heavy. For e.g for the
> >> iris data-set, if we decide to use a 'SparseNumeric' type for
> >> 'sepallength',
> >>
> >> {"name":
> >>
> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},
> >>
> >> the data may look like this,
> >>
> >>
> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
> >>
> >>
> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
> >>
> >> The complexity of a user with an existing Avro data to convert into a
> >> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier if we
> >> just distinguish it inside the code , say if at least one attribute in
> the
> >> metadata uses the generic Avro optionality (e.g ["null", "double"]),
> then
> >> we do readInstanceSparse() in the Loader and map correspondingly ? Or is
> >> there some other complexity that I have not looked at?
> >>
> >> 2) Yes . Skipping the Date-type attributes will make it easier !
> >>
> >> Regarding the engineering aspects,
> >>
> >> We can have the Avro dependecy in the deployable jar of SAMOA. In the
> >> code, may be
> >>
> >> 1) We could have an Avro equivalent of ArffFileStream.java & ArffLoader
> >> 2) May be a different Reader altogether for handling binary stream
> >> 3) A user option to switch between JSON/Binary encoding
> >>
> >> If there is a better way to do it, kindly advice.
> >>
> >> Thanks
> >> Jay
> >> https://github.com/jayadeepj
> >>
> >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales <
> >> [email protected]> wrote:
> >>
> >>> Hi Jayadeep,
> >>>
> >>> I think it's pretty cool!
> >>> If we get both Avro and Kafka support right, we can connect to almost
> >>> anything.
> >>>
> >>> The document looks very comprehensive, you seem to have given a lot of
> >>> thought to it.
> >>> I am not extremely familiar with Avro myself, I've just used it a
> couple
> >>> of times, but I'll try to provide some suggestions.
> >>>
> >>> - The general idea of where and how to store data and meta-data seems
> >>> right.
> >>> - In general, all attributes in a sparse instance are optional, and all
> >>> attributes in a dense instance are required. Maybe we want to be more
> >>> granular than this in the future, but it seems that Avro supports a
> >>> superset of these settings. We may want to have some defaults
> "prototypes"
> >>> in order to make mapping the current dense/sparse instances easy.
> >>> - Right now we are not making use of Date-type attributes in SAMOA
> >>> (there is no such thing in samoa-instances), so if it makes it easier
> we
> >>> could skip supporting it. Ideally we could have algorithms that respect
> >>> event-time as provided by timestamps in the instances (as opposed to
> >>> processing the event whenever it arrives), however we are not there
> yet :)
> >>>
> >>> All the rest seems pretty straightforward.
> >>>
> >>> Moving to the more software-engineering oriented aspects, where would
> we
> >>> have dependencies for Avro? And how should they be deployed? Would they
> >>> simply go inside the deployable uber-jar of SAMOA?
> >>>
> >>> Thanks,
> >>>
> >>> --
> >>> Gianmarco
> >>>
> >>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]> wrote:
> >>>
> >>>> Hi Gianmarco / All,
> >>>>
> >>>> I am working on an integration of SAMOA with Apache Avro. Basically I
> >>>> want to use data stored in Avro Files to be used as input to SAMOA.
> >>>>
> >>>> As I understand, current SAMOA readers only support ARFF format. Do
> you
> >>>> think such a feature would be useful to SAMOA in general ? Avro
> allows two
> >>>> encodings for the data: Binary & JSON. Hence an Avro support may allow
> >>>> users with JSON data also to use SAMOA.
> >>>>
> >>>> Based on the input given by @gdfm to @ctippur, I have prepared an
> Input
> >>>> Format document in Google Docs.
> >>>>
> >>>>
> >>>>
> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
> >>>>
> >>>>
> >>>> Would it be possible for you to have a look and provide your valuable
> >>>> suggestions ? Thanks
> >>>>
> >>>>
> >>>> Thanks
> >>>> Jay
> >>>> https://github.com/jayadeepj
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Thanks
> >> Jay
> >>
> >>
> >> Jayadeep J
> >> Mob: (+91) - 9176669142
> >>
> >
> >
>
>
> --
> Thanks
> Jay
>
>
> Jayadeep J
> Mob: (+91) - 9176669142
>

Reply via email to