[
https://issues.apache.org/jira/browse/SAMOA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224330#comment-15224330
]
ASF GitHub Bot commented on SAMOA-58:
-------------------------------------
Github user edi-bice commented on a diff in the pull request:
https://github.com/apache/incubator-samoa/pull/48#discussion_r58394751
--- Diff: samoa-api/src/main/java/org/apache/samoa/streams/FileStream.java
---
@@ -52,9 +49,18 @@
's', "Source Type (HDFS, local FS)", FileStreamSource.class,
"LocalFileStreamSource");
+ public IntOption classIndexOption = new IntOption("classIndex", 'c',
+ "Class index of data. 0 for none or -1 for last attribute in
file.", -1, -1, Integer.MAX_VALUE);
+
+ private FloatOption floatOption = new FloatOption("classWeight", 'w',
"", 1.0);
--- End diff --
Yes, it would. The class weight option is indeed implemented via instance
weights.
Most machine learning algorithms focus on total error and in extremely
imbalanced scenarios (fraud, terrorism, disease) would fail to detect the
sparse class which is really what we're after. Class weighting allows one to
incorporate apriori knowledge of the imbalance. For example sklearn, R e1071
SVM packages have class weights options.
> Samoa AvroFileStream from HDFSFileStreamSource stops at end of first file
> -------------------------------------------------------------------------
>
> Key: SAMOA-58
> URL: https://issues.apache.org/jira/browse/SAMOA-58
> Project: SAMOA
> Issue Type: Bug
> Components: SAMOA-Instances
> Environment: RHEL 6.6, java 1.8.0_72
> Reporter: Edi Bice
> Assignee: Gianmarco De Francisci Morales
>
> It appears Samoa is capable of streaming a collection of files as a single
> stream effectively concatenating the files. However using Samoa
> AvroFileStream from HDFSFileStreamSource seems the stream stops at end of
> first file:
> bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar
> "PrequentialEvaluation -i -1 -l (classifiers.ensemble.Bagging -s 100) -s
> (AvroFileStream -s HDFSFileStreamSource -f
> /tmp/order_and_feats_flat_avro/2016_02_18/ -c 1 -e binary) -f 10000"
> 2016-02-18 20:43:20,991 [main] INFO
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:183)
> - last event is received!
> 2016-02-18 20:43:20,991 [main] INFO
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:184)
> - total count: 262144
> ...
> 2016-02-18 20:43:20,993 [main] INFO
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:191)
> - total evaluation time: 34 seconds for 262144 instances
> bash-4.1$ hadoop fs -ls /tmp/order_and_feats_flat_avro/2016_02_18 | more
> Found 70 items
> -rw-r--r-- 3 yarn hdfs 230855335 2016-02-18 16:01
> /tmp/order_and_feats_flat_avro/2016_02_18/hdfs-1a238673-c4ec-4462-be67-78d573efa790-00001
> -rw-r--r-- 3 yarn hdfs 229800273 2016-02-18 16:04
> /tmp/order_and_feats_flat_avro/2016_02_18/hdfs-1a238673-c4ec-4462-be67-78d573efa790-00002
> ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)