Hi Gianmarco,

I have implemented this functionality as per the suggestions and have
raised a pull request.

The implementation details are as below.

1) A new AvroFileStream as a subclass of existing FileStream that will take
in the encoding format (json/binary) from command-line. It will use
InputStream  instead of current io Reader to handle Binary Streams.
2) A common Loader interface to make the parsing of streams generic rather
than only ARFF
3) A new AvroLoader abstract class in samoa-instances that will handle the
parsing of the Avro Generic Records from InputStream into SAMOA instances.
If even one attribute in the Avro schema has a null union (nullable
attribute) then it will be converted into  a SAMOA Sparse Instance else
DenseInstance
4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e.
AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & Avro
schema on initialization. They will use separate decoders to read from the
stream
5) Appropriate changes in poms , Instances.java & ARFFLoader to use the new
Loader interface

Though I have seen that the Travis build has failed. Couldn't see from the
logs if it is due to this code change

Thanks
Jay
https://github.com/jayadeepj

On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales <
[email protected]> wrote:

> Hi Jay,
>
> 1) I agree custom data types would be overkill.
> I was thinking of the second option you mentioned, distinguishing it
> inside the code.
> So the parser code would expect either all values to be optional, or all
> values to be required.
>
> I think the plan you have in mind is quite reasonable.
> I don't have other suggestions right now.
>
> Thanks,
>
> --
> Gianmarco
>
> On 21 October 2015 at 11:39, Jayadeep J <[email protected]> wrote:
>
>> Hi Gianmarco,
>>
>> Thanks for your reply. Regarding the points you mentioned,
>>
>> 1) W.r.t  Sparse & Dense instances, I am trying to understand what you
>> meant by "prototypes". Did you mean creating custom Avro data types like
>> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the actual
>> data stored in the file (JSON encoded) may become heavy. For e.g for the
>> iris data-set, if we decide to use a 'SparseNumeric' type for
>> 'sepallength',
>>
>> {"name":
>> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},
>>
>> the data may look like this,
>>
>> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
>>
>> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
>>
>> The complexity of a user with an existing Avro data to convert into a
>> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier if we
>> just distinguish it inside the code , say if at least one attribute in the
>> metadata uses the generic Avro optionality (e.g ["null", "double"]), then
>> we do readInstanceSparse() in the Loader and map correspondingly ? Or is
>> there some other complexity that I have not looked at?
>>
>> 2) Yes . Skipping the Date-type attributes will make it easier !
>>
>> Regarding the engineering aspects,
>>
>> We can have the Avro dependecy in the deployable jar of SAMOA. In the
>> code, may be
>>
>> 1) We could have an Avro equivalent of ArffFileStream.java & ArffLoader
>> 2) May be a different Reader altogether for handling binary stream
>> 3) A user option to switch between JSON/Binary encoding
>>
>> If there is a better way to do it, kindly advice.
>>
>> Thanks
>> Jay
>> https://github.com/jayadeepj
>>
>> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales <
>> [email protected]> wrote:
>>
>>> Hi Jayadeep,
>>>
>>> I think it's pretty cool!
>>> If we get both Avro and Kafka support right, we can connect to almost
>>> anything.
>>>
>>> The document looks very comprehensive, you seem to have given a lot of
>>> thought to it.
>>> I am not extremely familiar with Avro myself, I've just used it a couple
>>> of times, but I'll try to provide some suggestions.
>>>
>>> - The general idea of where and how to store data and meta-data seems
>>> right.
>>> - In general, all attributes in a sparse instance are optional, and all
>>> attributes in a dense instance are required. Maybe we want to be more
>>> granular than this in the future, but it seems that Avro supports a
>>> superset of these settings. We may want to have some defaults "prototypes"
>>> in order to make mapping the current dense/sparse instances easy.
>>> - Right now we are not making use of Date-type attributes in SAMOA
>>> (there is no such thing in samoa-instances), so if it makes it easier we
>>> could skip supporting it. Ideally we could have algorithms that respect
>>> event-time as provided by timestamps in the instances (as opposed to
>>> processing the event whenever it arrives), however we are not there yet :)
>>>
>>> All the rest seems pretty straightforward.
>>>
>>> Moving to the more software-engineering oriented aspects, where would we
>>> have dependencies for Avro? And how should they be deployed? Would they
>>> simply go inside the deployable uber-jar of SAMOA?
>>>
>>> Thanks,
>>>
>>> --
>>> Gianmarco
>>>
>>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]> wrote:
>>>
>>>> Hi Gianmarco / All,
>>>>
>>>> I am working on an integration of SAMOA with Apache Avro. Basically I
>>>> want to use data stored in Avro Files to be used as input to SAMOA.
>>>>
>>>> As I understand, current SAMOA readers only support ARFF format. Do you
>>>> think such a feature would be useful to SAMOA in general ? Avro allows two
>>>> encodings for the data: Binary & JSON. Hence an Avro support may allow
>>>> users with JSON data also to use SAMOA.
>>>>
>>>> Based on the input given by @gdfm to @ctippur, I have prepared an Input
>>>> Format document in Google Docs.
>>>>
>>>>
>>>> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
>>>>
>>>>
>>>> Would it be possible for you to have a look and provide your valuable
>>>> suggestions ? Thanks
>>>>
>>>>
>>>> Thanks
>>>> Jay
>>>> https://github.com/jayadeepj
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks
>> Jay
>>
>>
>> Jayadeep J
>> Mob: (+91) - 9176669142
>>
>
>


-- 
Thanks
Jay


Jayadeep J
Mob: (+91) - 9176669142

Reply via email to