[ 
https://issues.apache.org/jira/browse/SAMOA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991397#comment-14991397
 ] 

ASF GitHub Bot commented on SAMOA-47:
-------------------------------------

Github user jayadeepj commented on the pull request:

    https://github.com/apache/incubator-samoa/pull/40#issuecomment-154003164
  
    
    
    ## Test Data (Forest Cover)
    
    The JSON encoded AVRO File for the Forest CoverType dataset is @
    
https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing
    
    The BINARY encoded AVRO File for the Forest CoverType dataset is @
    
https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing
    
    
    ## Test Instructions
    
    ### Local - Avro JSON
    bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar 
"PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f 
covtypeNorm_json.avro -e json) -f 100000"
    ### Local - Avro Binary
    bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar 
"PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f 
covtypeNorm_binary.avro -e binary) -f 100000"
    
    ### Storm - Avro JSON
    bin/samoa storm target/SAMOA-Storm-0.4.0-incubating-SNAPSHOT.jar 
"PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f 
covtypeNorm_json.avro -e json) -f 100000" 
    ### Storm - Avro Binary
    bin/samoa storm target/SAMOA-Storm-0.4.0-incubating-SNAPSHOT.jar 
"PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f 
covtypeNorm_binary.avro -e binary) -f 100000" 
    
    
    ## Input Format Documentation
    
    The updated Input Format document for Avro files for SAMOA is present @
    
https://drive.google.com/file/d/0B844rHJZHzKMdk5oMHZWREdxMnM/view?usp=sharing
    
    ## Implementation Details
    
    1. A new AvroFileStream as a subclass of existing FileStream that will take 
in the encoding format (json/binary) from command-line. It will use InputStream 
 instead of current io Reader to handle Binary Streams.
    2. A common Loader interface to make the parsing of streams generic rather 
than only ARFF
    3. A new AvroLoader abstract class in samoa-instances that will handle the 
parsing of the Avro Generic Records from InputStream into SAMOA instances. If 
even one attribute in the Avro schema has a null union (nullable attribute) 
then it will be converted into  a SAMOA Sparse Instance else DenseInstance
    4. Two sub-classes of AvroLoader for Binary & JSON parsing i.e. 
AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & Avro schema 
on initialization. They will use separate decoders to read from the stream
    5. Appropriate changes in poms , Instances.java & ARFFLoader to use the new 
Loader interface 



> Integrate Avro Streams with SAMOA
> ---------------------------------
>
>                 Key: SAMOA-47
>                 URL: https://issues.apache.org/jira/browse/SAMOA-47
>             Project: SAMOA
>          Issue Type: New Feature
>          Components: SAMOA-API, SAMOA-Instances
>            Reporter: jayadeepj
>            Priority: Minor
>              Labels: patch
>
> The current SAMOA readers can only support data streams in ARFF format. Hence 
> SAMOA as a distributed streaming machine learning framework is limited in 
> scope since end users may have to transform their data to ARFF . Apache Avro 
> is a data serialization system that handles data streams in compact binary 
> format and is typically used in conjunction with with Big Data eco-system 
> tools. Avro allows two encodings for the data: Binary & JSON. Hence an Avro 
> support may allow users with JSON data also to use SAMOA seamlessly.
> The GOAL is to build support for Avro Streams into SAMOA by adding Avro File 
> Stream Handler, Avro Loader to read records & transform to instances and  a 
> user option to switch between JSON/Binary encodings. The input format with 
> representation of meta-data for both JSON/Binary data to be finalized along 
> with build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to