[
https://issues.apache.org/jira/browse/SAMOA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157073#comment-15157073
]
ASF GitHub Bot commented on SAMOA-58:
-------------------------------------
GitHub user edi-bice opened a pull request:
https://github.com/apache/incubator-samoa/pull/48
Patch for SAMOA-58 (Samoa AvroFileStream from HDFSFileStreamSource stops at
end of first file)
FileStreamSource seemed to support multiple files but during my testing it
turned out otherwise - Samoa AvroFileStream from HDFSFileStreamSource stops at
end of first file. I had to change AvroFileStream, ArffFileStream and their
parent FileStream in order to make this work.
See following JIRA for additional detail:
https://issues.apache.org/jira/browse/SAMOA-58
Additionally, I modified bin/samoa, pom.xml, SystemUtils (as well as added
a resource) to fix reading from HDFS on my cluster.
A seemingly unrelated change is the explicit test for supported Avro types
so as to filter out any fields that are not supported instead of assuming all
non-nominal (non-enum) fields are numeric and failing during reading.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/edi-bice/incubator-samoa master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-samoa/pull/48.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #48
----
commit 5cbbcfab94db47732ab44b3b9d752c45f02e2f30
Author: edi_bice <[email protected]>
Date: 2016-02-17T15:45:07Z
Only add fields of supported types (double, float, long, int and enum)
rather than adding and defaulting all non-enum to numeric and failing at value
parse time
commit d5a055f5c5ff0c6787beaa03234375cdcbb89cb5
Author: edi_bice <[email protected]>
Date: 2016-02-17T21:53:02Z
until we change samza to produce files with .avro extension
commit ba73bb24d9477207e8dfd85fbf478be1e3877c7d
Author: edi_bice <[email protected]>
Date: 2016-02-18T22:06:12Z
A tentative solution to issue described in:
https://issues.apache.org/jira/browse/SAMOA-58
commit 29e0379949eb7847ea46bfe432d98d90dff993e9
Author: edi_bice <[email protected]>
Date: 2016-02-19T16:55:03Z
Issue described in https://issues.apache.org/jira/browse/SAMOA-58 was
apparently more complicated than what was expected in previous commit. While we
did succeed in replacing the first exhausted file stream with a new one, the
loader was not changed and would return null. This rework of AvroFileStream,
FileStream and ArffFileStream hopefully cleans things up a bit and allows
multi-file streams of either (Avro or Arff) type.
commit fe093240a248e26be84ded4d378acc1d5c81d599
Author: edi_bice <[email protected]>
Date: 2016-01-25T17:02:22Z
configure don't code
commit 99f04bb4396190e92af2a43e56d005cb502357ca
Author: Edi Bice <[email protected]>
Date: 2016-02-22T14:25:43Z
cherry-picked from faf branch - changes needed to be able to read from HDFS
on a YARN 2.7.1 cluster
----
> Samoa AvroFileStream from HDFSFileStreamSource stops at end of first file
> -------------------------------------------------------------------------
>
> Key: SAMOA-58
> URL: https://issues.apache.org/jira/browse/SAMOA-58
> Project: SAMOA
> Issue Type: Bug
> Components: SAMOA-Instances
> Environment: RHEL 6.6, java 1.8.0_72
> Reporter: Edi Bice
>
> It appears Samoa is capable of streaming a collection of files as a single
> stream effectively concatenating the files. However using Samoa
> AvroFileStream from HDFSFileStreamSource seems the stream stops at end of
> first file:
> bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar
> "PrequentialEvaluation -i -1 -l (classifiers.ensemble.Bagging -s 100) -s
> (AvroFileStream -s HDFSFileStreamSource -f
> /tmp/order_and_feats_flat_avro/2016_02_18/ -c 1 -e binary) -f 10000"
> 2016-02-18 20:43:20,991 [main] INFO
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:183)
> - last event is received!
> 2016-02-18 20:43:20,991 [main] INFO
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:184)
> - total count: 262144
> ...
> 2016-02-18 20:43:20,993 [main] INFO
> org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:191)
> - total evaluation time: 34 seconds for 262144 instances
> bash-4.1$ hadoop fs -ls /tmp/order_and_feats_flat_avro/2016_02_18 | more
> Found 70 items
> -rw-r--r-- 3 yarn hdfs 230855335 2016-02-18 16:01
> /tmp/order_and_feats_flat_avro/2016_02_18/hdfs-1a238673-c4ec-4462-be67-78d573efa790-00001
> -rw-r--r-- 3 yarn hdfs 229800273 2016-02-18 16:04
> /tmp/order_and_feats_flat_avro/2016_02_18/hdfs-1a238673-c4ec-4462-be67-78d573efa790-00002
> ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)