No one has any ideas?

Is there some more information I should provide?

I am looking for ways to increase the parallelism among workers. Currently
I just see number of simultaneous connections to Solr equal to the number
of workers. My number of partitions is (2.5x) larger than number of
workers, and the workers seem to be large enough to handle more than one
task at a time.

I am creating a single client per partition in my mapPartition call. Not
sure if that is creating the gating situation? Perhaps I should use a Pool
of clients instead?

Would really appreciate some pointers.

Thanks in advance for any help you can provide.

-sujit


On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal <sujitatgt...@gmail.com> wrote:

> Hello,
>
> I am trying to run a Spark job that hits an external webservice to get
> back some information. The cluster is 1 master + 4 workers, each worker has
> 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
> and is accessed using code similar to that shown below.
>
> def getResults(keyValues: Iterator[(String, Array[String])]):
>>         Iterator[(String, String)] = {
>>     val solr = new HttpSolrClient()
>>     initializeSolrParameters(solr)
>>     keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>> }
>> myRDD.repartition(10)
>
>              .mapPartitions(keyValues => getResults(keyValues))
>>
>
> The mapPartitions does some initialization to the SolrJ client per
> partition and then hits it for each record in the partition via the
> getResults() call.
>
> I repartitioned in the hope that this will result in 10 clients hitting
> Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
> clients if I can). However, I counted the number of open connections using
> "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
> observed that Solr has a constant 4 clients (ie, equal to the number of
> workers) over the lifetime of the run.
>
> My observation leads me to believe that each worker processes a single
> stream of work sequentially. However, from what I understand about how
> Spark works, each worker should be able to process number of tasks
> parallelly, and that repartition() is a hint for it to do so.
>
> Is there some SparkConf environment variable I should set to increase
> parallelism in these workers, or should I just configure a cluster with
> multiple workers per machine? Or is there something I am doing wrong?
>
> Thank you in advance for any pointers you can provide.
>
> -sujit
>
>

Reply via email to