[ 
https://issues.apache.org/jira/browse/SAMOA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224330#comment-15224330
 ] 

ASF GitHub Bot commented on SAMOA-58:
-------------------------------------

Github user edi-bice commented on a diff in the pull request:

    https://github.com/apache/incubator-samoa/pull/48#discussion_r58394751
  
    --- Diff: samoa-api/src/main/java/org/apache/samoa/streams/FileStream.java 
---
    @@ -52,9 +49,18 @@
           's', "Source Type (HDFS, local FS)", FileStreamSource.class,
           "LocalFileStreamSource");
     
    +  public IntOption classIndexOption = new IntOption("classIndex", 'c',
    +          "Class index of data. 0 for none or -1 for last attribute in 
file.", -1, -1, Integer.MAX_VALUE);
    +
    +  private FloatOption floatOption = new FloatOption("classWeight", 'w', 
"", 1.0);
    --- End diff --
    
    Yes, it would. The class weight option is indeed implemented via instance 
weights. 
    
    Most machine learning algorithms focus on total error and in extremely 
imbalanced scenarios (fraud, terrorism, disease) would fail to detect the 
sparse class which is really what we're after. Class weighting allows one to 
incorporate apriori knowledge of the imbalance. For example sklearn, R e1071 
SVM packages have class weights options.


> Samoa AvroFileStream from HDFSFileStreamSource stops at end of first file
> -------------------------------------------------------------------------
>
>                 Key: SAMOA-58
>                 URL: https://issues.apache.org/jira/browse/SAMOA-58
>             Project: SAMOA
>          Issue Type: Bug
>          Components: SAMOA-Instances
>         Environment: RHEL 6.6, java 1.8.0_72
>            Reporter: Edi Bice
>            Assignee: Gianmarco De Francisci Morales
>
> It appears Samoa is capable of streaming a collection of files as a single 
> stream effectively concatenating the files. However using Samoa 
> AvroFileStream from HDFSFileStreamSource seems the stream stops at end of 
> first file:
> bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar 
> "PrequentialEvaluation -i -1 -l (classifiers.ensemble.Bagging -s 100) -s 
> (AvroFileStream -s HDFSFileStreamSource -f 
> /tmp/order_and_feats_flat_avro/2016_02_18/ -c 1 -e binary) -f 10000"
> 2016-02-18 20:43:20,991 [main] INFO  
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:183) 
> - last event is received!
> 2016-02-18 20:43:20,991 [main] INFO  
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:184) 
> - total count: 262144
> ...
> 2016-02-18 20:43:20,993 [main] INFO  
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:191) 
> - total evaluation time: 34 seconds for 262144 instances
> bash-4.1$ hadoop fs -ls /tmp/order_and_feats_flat_avro/2016_02_18 | more
> Found 70 items
> -rw-r--r--   3 yarn hdfs  230855335 2016-02-18 16:01 
> /tmp/order_and_feats_flat_avro/2016_02_18/hdfs-1a238673-c4ec-4462-be67-78d573efa790-00001
> -rw-r--r--   3 yarn hdfs  229800273 2016-02-18 16:04 
> /tmp/order_and_feats_flat_avro/2016_02_18/hdfs-1a238673-c4ec-4462-be67-78d573efa790-00002
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to