I was replying to Nikaash. Sorry -- list keeps rejecting replies because of the size, i had to remove the content
On Fri, Apr 29, 2016 at 9:05 AM, Khurrum Nasim <[email protected]> wrote: > Is that for me Dimitry ? > > > > > On Apr 29, 2016, at 11:53 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > > can you please look into spark UI and write down how many split the job > > generates in the first stage of the pipeline, or anywhere else there's > > signficant variation in # of splits in both cases? > > > > the row similarity is a very short pipeline (in comparison with what > would > > normally be on average). so only the first input re-splitting is > critical. > > > > The splitting along the products is adjusted by optimizer automatically > to > > match the amount of data segments observed on average in the input(s). > e.g. > > if uyou compute val C = A %*% B and A has 500 elements per split and B > has > > 5000 elements per split then C would approximately have 5000 elements per > > split (the larger average in binary operator cases). That's > approximately > > how it works. > > > > However, the par() that has been added, is messing with initial > parallelism > > which would naturally affect the rest of pipeline per above. I now doubt > it > > was a good thing -- when i suggested Pat to try this, i did not mean to > put > > it _inside_ the algorithm itself, rather, into the accurate input > > preparation code in his particular case. However, I don't think it will > > work in any given case. Actually sweet spot parallelism for > multioplication > > unfortunately depends on tons of factors -- network bandwidth and > hardware > > configuration, so it is difficult to give it a good guess universally. > More > > likely, for cli-based prepackaged algorithms (I don't use CLI but rather > > assemble pipelines in scala via scripting and scala application code) the > > initial paralellization adjustment options should probably be provided to > > CLI. > >
