Re: Large DataStructure to Broadcast

Christopher Nguyen Tue, 07 Jan 2014 17:17:24 -0800

Purav, look up the Singleton pattern which is what you seem to be
describing.


The strategy you describe does not sound like a good idea, however. It
couples the "lookup" service rather strongly (and serially) to its data
processing clients. This is usually, though not always, less robust and
efficient.

Sent while mobile. Pls excuse typos etc.
On Jan 7, 2014 10:30 AM, "purav aggarwal" <puravaggarwal...@gmail.com>
wrote:

> Thanks.
> Broadcasting such huge entities does not seem like a feasible solution.
> Serialization-Deserialization and network seem to have a huge overhead for
> large files.
>
> Before I consider moving into an external lookup service (as Christopher
> rightly suggested) I was wondering if I could make each slave load the
> large file in memory and do lookup operations in parallel.
>
> *I am struck at how to make each slave load the files just once and perform
> the lookup service.*
>
> I tried using a hack where I check if the object is not initialised, I
> shall initialise it. The problem is now for multiple threads running on a
> single slave, I need a global object (specific to the JVM on that slave) to
> hold on the other threads using "synchronized" while one of them is loading
> the large file for me.
> Any suggestions what can that unique object specific to that particular JVM
> be. Is SparkContext an option ?
>
>
>
> On Thu, Dec 26, 2013 at 10:41 AM, Christopher Nguyen <c...@adatao.com>
> wrote:
>
> > Purav, depending on the access pattern you should also consider the
> > trade-offs of setting up a lookup service (using, e.g., memcached, egad!)
> > which may end up being more efficient overall.
> >
> > The general point is not to restrict yourself to only Spark APIs when
> > considering the overall architecture.
> > --
> > Christopher T. Nguyen
> > Co-founder & CEO, Adatao <http://adatao.com>
> > linkedin.com/in/ctnguyen
> >
> >
> >
> > On Wed, Dec 25, 2013 at 7:32 PM, purav aggarwal
> > <puravaggarwal...@gmail.com>wrote:
> >
> > > Hi all,
> > >
> > > I have a large file ( > 5 gigs) which I need to lookup. Since each
> slave
> > > need to perform the search operation on the hashmap (built out of the
> > file)
> > > in parallel I need to broadcast the file. I was wondering if
> broadcasting
> > > such a huge file is really a good idea. Do we have any benchmarks for
> the
> > > broadcast variables. I am on a Standalone cluster and machine
> > configuration
> > > is not a problem at the moment.
> > > Has anyone exploited broadcast to such an extent ?
> > >
> > > Thanks,
> > > Purav
> > >
> >
>

Re: Large DataStructure to Broadcast

Reply via email to