Re: [DISCUSSION] Spark Data Frame through Thin Client

Nikolay Izhikov Mon, 22 Oct 2018 02:43:53 -0700

Hello, Stephen.

I suggest thin client deployment as a second option together with existing 
integration that use Client Node.


> I’m thinking specifically about better support for Spark Streaming, where the 
> lack  of continuous query support in thin clients removes a significant 
> optimisation option. 

It's very interesting.
Can you share you thoughts?
What can be improved in Spark integration?

В Пн, 22/10/2018 в 10:22 +0100, Stephen Darlington пишет:
> Are you suggesting making the Thin Client deployment an option or as a 
> replacement for the thick-client? If the latter, do we risk making future 
> desirable changes more difficult (or impossible)? I’m thinking specifically 
> about better support for Spark Streaming, where the lack  of continuous query 
> support in thin clients removes a significant optimisation option. I’m sure 
> there are other use cases.
> 
> Regards,
> Stephen
> 
> > On 21 Oct 2018, at 09:08, Nikolay Izhikov <nizhi...@apache.org> wrote:
> > 
> > Valentin.
> > 
> > Seems, You made several suggestions, which is not always true, from my 
> > point of view:
> > 
> > 1. "We have access to Spark cluster installation to perform deployment 
> > steps" - this is not true in cloud or enterprise environment.
> > 
> > 2. "Spark cluster is used only for Ignite integration".
> > From what I know computational resources for big Spark cluster is divided 
> > by many business divisions.
> > And it is not convenient to perform some deployment steps on this cluster.
> > 
> > 3. "When Ignite + Spark are used in real production it's OK to have 
> > reasonable deployment overhead"
> > What about developer who want to play with this integration?
> > And want to do it quickly to see how it works in real life examples.
> > Can we do his life much easier?
> > 
> > > First of all, they will exist with thin client either.
> > 
> > Spark have an ability to deploy jars on worker and add it to application 
> > tasks classpath.
> > For 2.6 we must deploy 11 additional jars to start using Ignite.
> > Please, see my example on the bottom of documentation page [1]
> > 
> > Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies 
> > for Ignite integration for you?
> > And for our users? :)
> > 
> > Actually, list of dependencies will be changed in 2.7 - new version of 
> > jcache, new version of h2
> > So user should change it in code or perform additional deployment steps.
> > 
> > It overkill for me.
> > 
> > On the other hand - thin client requires only 1 jar.
> > Moreover, thin client protocol have the backward compatibility.
> > So thin client will perform correctly when Ignite cluster will be updated 
> > from 2.6 to 2.7.
> > So, with Spark integration via thin client we will be able to update Ignite 
> > cluster and Spark integration separately.
> > For now, we should do it in one big step.
> > 
> > What do you think?
> > 
> > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > 
> > В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
> > > Guys,
> > > 
> > > From my experience, Ignite and Spark clusters typically run in the same
> > > environment, which makes client node a more preferable option. Mainly,
> > > because of performance. BTW, I doubt partition-awareness on thin client
> > > will help either, because in dataframes we only run SQL queries and I
> > > believe thin client will execute them through a proxy anyway. But correct
> > > me if I’m wrong.
> > > 
> > > Either way, it sounds like we just have usability issues with Ignite/Spark
> > > integration. Why don’t we concentrate on fixing them then? For example, #3
> > > can be fixed by loading XML content on master and then distributing it to
> > > workers, instead of loading on every worker independently. Then there are
> > > certain procedures like deploying JARs, etc. First of all, they will exist
> > > with thin client either. Second of all, I’m sure there are ways to 
> > > simplify
> > > this procedures and make integration easier. My opinion is that working on
> > > such improvements is going to add more value than another implementation
> > > based on thin client.
> > > 
> > > -Val
> > > 
> > > On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dma...@apache.org> wrote:
> > > 
> > > > Hello Nikolay,
> > > > 
> > > > Your proposal sounds reasonable. However, I would suggest us to wait 
> > > > while
> > > > partition-awareness is supported for Java thin client first. With that
> > > > feature, the client can connect to any node directly while presently all
> > > > the communication goes through a proxy (a node the client is connected 
> > > > to).
> > > > All of that is bad for performance.
> > > > 
> > > > 
> > > > Vladimir, how hard would it be to support the partition-awareness for 
> > > > Java
> > > > client? Probably, Nikolay can take over.
> > > > 
> > > > --
> > > > Denis
> > > > 
> > > > 
> > > > On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <nizhi...@apache.org>
> > > > wrote:
> > > > 
> > > > > Hello, Igniters.
> > > > > 
> > > > > Currently, Spark Data Frame integration implemented via client node
> > > > > connection.
> > > > > Whenever we need to retrieve some data into Spark worker(or master) 
> > > > > from
> > > > > Ignite we start a client node.
> > > > > 
> > > > > It has several major disadvantages:
> > > > > 
> > > > >        1. We should copy whole Ignite distribution on to each Spark
> > > > > worker [1]
> > > > >        2. We should copy whole Ignite distribution on to Spark master 
> > > > > to
> > > > > get catalogue works.
> > > > >        3. We should have the same absolute path to Ignite 
> > > > > configuration
> > > > > file on every worker and provide it during data frame construction [2]
> > > > >        4. We should additionally configure Spark workerks classpath to
> > > > > include Ignite libraries.
> > > > > 
> > > > > For now, almost all operation we need to do in Spark Data Frame
> > > > > integration is supported by Java Thin Client.
> > > > >        * obtain the list of caches.
> > > > >        * get cache configuration.
> > > > >        * execute SQL query.
> > > > >        * stream data to the table - don't support by the thin client 
> > > > > for
> > > > > now, but can be implemented using simple SQL INSERT statements.
> > > > > 
> > > > > Advantages of usage Java Thin Client in Spark integration(they all 
> > > > > known
> > > > > from Java Thin Client advantages):
> > > > >        1. Easy to configure: only IP addresses of server nodes are
> > > > > required.
> > > > >        2. Easy to deploy: only 1 additional jar required. No server
> > > > > side(Ignite worker) configuration required.
> > > > > 
> > > > > I propose to implement Spark Data Frame integration through Java Thin
> > > > > Client.
> > > > > 
> > > > > Thoughts?
> > > > > 
> > > > > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > > > > [2]
> > > > > 
> > > > 
> > > > https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> > > > > 
> 
>

signature.asc
Description: This is a digitally signed message part

Re: [DISCUSSION] Spark Data Frame through Thin Client

Reply via email to