Re: Best practices for sharing/maintaining large resource files for Spark jobs

Dmitry Goldenberg Thu, 14 Jan 2016 04:58:18 -0800

The other thing from some folks' recommendations on this list was Apache
Ignite.  Their In-Memory File System (
https://ignite.apache.org/features/igfs.html) looks quite interesting.


On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:

> OK so it looks like Tachyon is a cluster memory plugin marked as
> "experimental" in Spark.
>
> In any case, we've got a few requirements for the system we're working on
> which may drive the decision for how to implement large resource file
> management.
>
> The system is a framework of N data analyzers which take incoming
> documents as input and transform them or extract some data out of those
> documents.  These analyzers can be chained together which makes it a great
> case for processing with RDD's and a set of map/filter types of Spark
> functions. There's already an established framework API which we want to
> preserve.  This means that most likely, we'll create a relatively thin
> "binding" layer for exposing these analyzers as well-documented functions
> to the end-users who want to use them in a Spark based distributed
> computing environment.
>
> We also want to, ideally, hide the complexity of how these resources are
> loaded from the end-users who will be writing the actual Spark jobs that
> utilize the Spark "binding" functions that we provide.
>
> So, for managing large numbers of small, medium, or large resource files,
> we're considering the below options, with a variety of pros and cons
> attached, from the following perspectives:
>
> a) persistence - where do the resources reside initially;
> b) loading - what are the mechanics for loading of these resources;
> c) caching and sharing across worker nodes.
>
> Possible options:
>
> 1. Load each resource into a broadcast variable. Considering that we have
> scores if not hundreds of these resource files, maintaining that many
> broadcast variables seems like a complexity that's going to be hard to
> manage. We'd also need a translation layer between the broadcast variables
> and the internal API that would want to "speak" InputStream's rather than
> broadcast variables.
>
> 2. Load resources into RDD's and perform join's against them from our
> incoming document data RDD's, thus achieving the effect of a value lookup
> from the resources.  While this seems like a very Spark'y way of doing
> things, the lookup mechanics seem quite non-trivial, especially because
> some of the resources aren't going to be pure dictionaries; they may be
> statistical models.  Additionally, this forces us to utilize Spark's
> semantics for handling of these resources which means a potential rewrite
> of our internal product API. That would be a hard option to go with.
>
> 3. Pre-install all the needed resources on each of the worker nodes;
> retrieve the needed resources from the file system and load them into
> memory as needed. Ideally, the resources would only be installed once, on
> the Spark driver side; we'd want to avoid having to pre-install all these
> files on each node. However, we've done this as an exercise and this
> approach works OK.
>
> 4. Pre-load all the resources into HDFS or S3 i.e. into some distributed
> persistent store; load them into cluster memory from there, as necessary.
> Presumably this could be a pluggable store with a common API exposed.
> Since our framework is an OEM'able product, we could plug and play with a
> variety of such persistent stores via Java's FileSystem/URL scheme handler
> API's.
>
> 5. Implement a Resource management server, with a RESTful interface on
> top. Under the covers, this could be a wrapper on top of #4.  Potentially
> unnecessary if we have a solid persistent store API as per #4.
>
> 6. Beyond persistence, caching also has to be considered for these
> resources. We've considered Tachyon (especially since it's pluggable into
> Spark), Redis, and the like. Ideally, I would think we'd want resources to
> be loaded into the cluster memory as needed; paged in/out on-demand in an
> LRU fashion.  From this perspective, it's not yet clear to me what the best
> option(s) would be. Any thoughts / recommendations would be appreciated.
>
>
>
>
>
> On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Thanks, Gene.
>>
>> Does Spark use Tachyon under the covers anyway for implementing its
>> "cluster memory" support?
>>
>> It seems that the practice I hear the most about is the idea of loading
>> resources as RDD's and then doing join's against them to achieve the lookup
>> effect.
>>
>> The other approach would be to load the resources into broadcast
>> variables but I've heard concerns about memory.  Could we run out of memory
>> if we load too much into broadcast vars?  Is there any memory_to_disk/spill
>> to disk capability for broadcast variables in Spark?
>>
>>
>> On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <gene.p...@gmail.com> wrote:
>>
>>> Hi Dmitry,
>>>
>>> Yes, Tachyon can help with your use case. You can read and write to
>>> Tachyon via the filesystem api (
>>> http://tachyon-project.org/documentation/File-System-API.html). There
>>> is a native Java API as well as a Hadoop-compatible API. Spark is also able
>>> to interact with Tachyon via the Hadoop-compatible API, so Spark jobs can
>>> read input files from Tachyon and write output files to Tachyon.
>>>
>>> I hope that helps,
>>> Gene
>>>
>>> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> I'd guess that if the resources are broadcast Spark would put them into
>>>> Tachyon...
>>>>
>>>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <
>>>> dgoldenberg...@gmail.com> wrote:
>>>>
>>>> Would it make sense to load them into Tachyon and read and broadcast
>>>> them from there since Tachyon is already a part of the Spark stack?
>>>>
>>>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>>>
>>>>
>>>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>>>> sabarish.sasidha...@manthan.com> wrote:
>>>>
>>>> One option could be to store them as blobs in a cache like Redis and
>>>> then read + broadcast them from the driver. Or you could store them in HDFS
>>>> and read + broadcast from the driver.
>>>>
>>>> Regards
>>>> Sab
>>>>
>>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>>>> dgoldenberg...@gmail.com> wrote:
>>>>
>>>>> We have a bunch of Spark jobs deployed and a few large resource files
>>>>> such as e.g. a dictionary for lookups or a statistical model.
>>>>>
>>>>> Right now, these are deployed as part of the Spark jobs which will
>>>>> eventually make the mongo-jars too bloated for deployments.
>>>>>
>>>>> What are some of the best practices to consider for maintaining and
>>>>> sharing large resource files like these?
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Architect - Big Data
>>>> Ph: +91 99805 99458
>>>>
>>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>>> Sullivan India ICT)*
>>>> +++
>>>>
>>>>
>>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Reply via email to