I was replying to Nikaash.

Sorry -- list keeps rejecting replies because of the size, i had to remove
the content

On Fri, Apr 29, 2016 at 9:05 AM, Khurrum Nasim <[email protected]>
wrote:

> Is that for me Dimitry ?
>
>
>
> > On Apr 29, 2016, at 11:53 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
> >
> > can you please look into spark UI and write down how many split the job
> > generates in the first stage of the pipeline, or anywhere else there's
> > signficant variation in # of splits in both cases?
> >
> > the row similarity is a very short pipeline (in comparison with what
> would
> > normally be on average). so only the first input re-splitting is
> critical.
> >
> > The splitting along the products is adjusted by optimizer automatically
> to
> > match the amount of data segments observed on average in the input(s).
> e.g.
> > if uyou compute val C = A %*% B and A has 500 elements per split and B
> has
> > 5000 elements per split then C would approximately have 5000 elements per
> > split (the larger average in binary operator cases).  That's
> approximately
> > how it works.
> >
> > However, the par() that has been added, is messing with initial
> parallelism
> > which would naturally affect the rest of pipeline per above. I now doubt
> it
> > was a good thing -- when i suggested Pat to try this, i did not mean to
> put
> > it _inside_ the algorithm itself, rather, into the accurate input
> > preparation code in his particular case. However, I don't think it will
> > work in any given case. Actually sweet spot parallelism for
> multioplication
> > unfortunately depends on tons of factors -- network bandwidth and
> hardware
> > configuration, so it is difficult to give it a good guess universally.
> More
> > likely, for cli-based prepackaged algorithms (I don't use CLI but rather
> > assemble pipelines in scala via scripting and scala application code) the
> > initial paralellization adjustment options should probably be provided to
> > CLI.
>
>

Reply via email to