Like this?

val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls =>
speachRecognizer(urls))

Let 24 be the total number of cores that you have on all the workers.

Thanks
Best Regards

On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf <opus...@gmail.com> wrote:

> Hello, I am writing a Spark application to use speech recognition to
> transcribe a very large number of recordings.
>
> I need some help configuring Spark.
>
> My app is basically a transformation with no side effects: recording URL
> --> transcript.  The input is a huge file with one URL per line, and the
> output is a huge file of transcripts.
>
> The speech recognizer is written in Java (Sphinx4), so it can be packaged
> as a JAR.
>
> The recognizer is very processor intensive, so you can't run too many on
> one machine-- perhaps one recognizer per core.  The recognizer is also
> big-- maybe 1 GB.  But, most of the recognizer is a immutable acoustic and
> language models that can be shared with other instances of the recognizer.
>
> So I want to run about one recognizer per core of each machine in my
> cluster.  I want all recognizer on one machine to run within the same JVM
> and share the same models.
>
> How does one configure Spark for this sort of application?  How does one
> control how Spark deploys the stages of the process.  Can someone point me
> to an appropriate doc or keywords I should Google.
>
> Thanks
> Peter
>

Reply via email to