Like this? val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls => speachRecognizer(urls))
Let 24 be the total number of cores that you have on all the workers. Thanks Best Regards On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf <opus...@gmail.com> wrote: > Hello, I am writing a Spark application to use speech recognition to > transcribe a very large number of recordings. > > I need some help configuring Spark. > > My app is basically a transformation with no side effects: recording URL > --> transcript. The input is a huge file with one URL per line, and the > output is a huge file of transcripts. > > The speech recognizer is written in Java (Sphinx4), so it can be packaged > as a JAR. > > The recognizer is very processor intensive, so you can't run too many on > one machine-- perhaps one recognizer per core. The recognizer is also > big-- maybe 1 GB. But, most of the recognizer is a immutable acoustic and > language models that can be shared with other instances of the recognizer. > > So I want to run about one recognizer per core of each machine in my > cluster. I want all recognizer on one machine to run within the same JVM > and share the same models. > > How does one configure Spark for this sort of application? How does one > control how Spark deploys the stages of the process. Can someone point me > to an appropriate doc or keywords I should Google. > > Thanks > Peter >