The other thing from some folks' recommendations on this list was Apache Ignite. Their In-Memory File System ( https://ignite.apache.org/features/igfs.html) looks quite interesting.
On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com > wrote: > OK so it looks like Tachyon is a cluster memory plugin marked as > "experimental" in Spark. > > In any case, we've got a few requirements for the system we're working on > which may drive the decision for how to implement large resource file > management. > > The system is a framework of N data analyzers which take incoming > documents as input and transform them or extract some data out of those > documents. These analyzers can be chained together which makes it a great > case for processing with RDD's and a set of map/filter types of Spark > functions. There's already an established framework API which we want to > preserve. This means that most likely, we'll create a relatively thin > "binding" layer for exposing these analyzers as well-documented functions > to the end-users who want to use them in a Spark based distributed > computing environment. > > We also want to, ideally, hide the complexity of how these resources are > loaded from the end-users who will be writing the actual Spark jobs that > utilize the Spark "binding" functions that we provide. > > So, for managing large numbers of small, medium, or large resource files, > we're considering the below options, with a variety of pros and cons > attached, from the following perspectives: > > a) persistence - where do the resources reside initially; > b) loading - what are the mechanics for loading of these resources; > c) caching and sharing across worker nodes. > > Possible options: > > 1. Load each resource into a broadcast variable. Considering that we have > scores if not hundreds of these resource files, maintaining that many > broadcast variables seems like a complexity that's going to be hard to > manage. We'd also need a translation layer between the broadcast variables > and the internal API that would want to "speak" InputStream's rather than > broadcast variables. > > 2. Load resources into RDD's and perform join's against them from our > incoming document data RDD's, thus achieving the effect of a value lookup > from the resources. While this seems like a very Spark'y way of doing > things, the lookup mechanics seem quite non-trivial, especially because > some of the resources aren't going to be pure dictionaries; they may be > statistical models. Additionally, this forces us to utilize Spark's > semantics for handling of these resources which means a potential rewrite > of our internal product API. That would be a hard option to go with. > > 3. Pre-install all the needed resources on each of the worker nodes; > retrieve the needed resources from the file system and load them into > memory as needed. Ideally, the resources would only be installed once, on > the Spark driver side; we'd want to avoid having to pre-install all these > files on each node. However, we've done this as an exercise and this > approach works OK. > > 4. Pre-load all the resources into HDFS or S3 i.e. into some distributed > persistent store; load them into cluster memory from there, as necessary. > Presumably this could be a pluggable store with a common API exposed. > Since our framework is an OEM'able product, we could plug and play with a > variety of such persistent stores via Java's FileSystem/URL scheme handler > API's. > > 5. Implement a Resource management server, with a RESTful interface on > top. Under the covers, this could be a wrapper on top of #4. Potentially > unnecessary if we have a solid persistent store API as per #4. > > 6. Beyond persistence, caching also has to be considered for these > resources. We've considered Tachyon (especially since it's pluggable into > Spark), Redis, and the like. Ideally, I would think we'd want resources to > be loaded into the cluster memory as needed; paged in/out on-demand in an > LRU fashion. From this perspective, it's not yet clear to me what the best > option(s) would be. Any thoughts / recommendations would be appreciated. > > > > > > On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Thanks, Gene. >> >> Does Spark use Tachyon under the covers anyway for implementing its >> "cluster memory" support? >> >> It seems that the practice I hear the most about is the idea of loading >> resources as RDD's and then doing join's against them to achieve the lookup >> effect. >> >> The other approach would be to load the resources into broadcast >> variables but I've heard concerns about memory. Could we run out of memory >> if we load too much into broadcast vars? Is there any memory_to_disk/spill >> to disk capability for broadcast variables in Spark? >> >> >> On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <gene.p...@gmail.com> wrote: >> >>> Hi Dmitry, >>> >>> Yes, Tachyon can help with your use case. You can read and write to >>> Tachyon via the filesystem api ( >>> http://tachyon-project.org/documentation/File-System-API.html). There >>> is a native Java API as well as a Hadoop-compatible API. Spark is also able >>> to interact with Tachyon via the Hadoop-compatible API, so Spark jobs can >>> read input files from Tachyon and write output files to Tachyon. >>> >>> I hope that helps, >>> Gene >>> >>> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg < >>> dgoldenberg...@gmail.com> wrote: >>> >>>> I'd guess that if the resources are broadcast Spark would put them into >>>> Tachyon... >>>> >>>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg < >>>> dgoldenberg...@gmail.com> wrote: >>>> >>>> Would it make sense to load them into Tachyon and read and broadcast >>>> them from there since Tachyon is already a part of the Spark stack? >>>> >>>> If so I wonder if I could do that Tachyon read/write via a Spark API? >>>> >>>> >>>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan < >>>> sabarish.sasidha...@manthan.com> wrote: >>>> >>>> One option could be to store them as blobs in a cache like Redis and >>>> then read + broadcast them from the driver. Or you could store them in HDFS >>>> and read + broadcast from the driver. >>>> >>>> Regards >>>> Sab >>>> >>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg < >>>> dgoldenberg...@gmail.com> wrote: >>>> >>>>> We have a bunch of Spark jobs deployed and a few large resource files >>>>> such as e.g. a dictionary for lookups or a statistical model. >>>>> >>>>> Right now, these are deployed as part of the Spark jobs which will >>>>> eventually make the mongo-jars too bloated for deployments. >>>>> >>>>> What are some of the best practices to consider for maintaining and >>>>> sharing large resource files like these? >>>>> >>>>> Thanks. >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Architect - Big Data >>>> Ph: +91 99805 99458 >>>> >>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and >>>> Sullivan India ICT)* >>>> +++ >>>> >>>> >>> >> >