Re: [DISCUSSION] Spark Data Frame through Thin Client

Nikolay Izhikov Wed, 24 Oct 2018 02:49:40 -0700

Hello, Valentin.

> What I don't agree with is that replacing thick client with thin client is a 
> way to fix usability issues.


I think it will fix some of them.

> will potentially compromise the performance

As I mentioned earlier, I want to provide easy way to play with integration.
For maximum performance one should use client nodes.

> What is the difference between thin and thick client from this point of view?

We need only 1 jar file.
All config options we need is list of ip addressed.

> I'm not arguing there are usability issues with thick client. 
> I'm just suggesting to fix those issues first, before we jump reworking the 
> implementation.

> My suggestion is to look at usability issues and try to fix them without 
> getting rid of thick client.

I agree, let's do it!
Can you create some tickets?
I'm ready to look at it and contribute a fix.

В Вт, 23/10/2018 в 19:31 -0700, Valentin Kulichenko пишет:
> Nikolay,
> 
> Please see my comments below. Actually, I haven't made most of the
> assumptions that you mentioned, and I generally agree with you. What I
> don't agree with is that replacing thick client with thin client is a way
> to fix usability issues. Thin client is not going to be issue-free either,
> but will potentially compromise the performance, as well as functionality
> (like streaming, as Stephen mentioned). My suggestion is to look at
> usability issues and try to fix them without getting rid of thick client.
> 
> -Val
> 
> On Sun, Oct 21, 2018 at 1:08 AM Nikolay Izhikov <nizhi...@apache.org> wrote:
> 
> > Valentin.
> > 
> > Seems, You made several suggestions, which is not always true, from my
> > point of view:
> > 
> > 1. "We have access to Spark cluster installation to perform deployment
> > steps" - this is not true in cloud or enterprise environment.
> > 
> 
> Can you please elaborate on this? What is the difference between thin and
> thick client from this point of view? I understand that the latter would
> generally be more complicated, but how would one use thin client without
> deploying a JAR?
> 
> 
> > 
> > 2. "Spark cluster is used only for Ignite integration".
> > From what I know computational resources for big Spark cluster is divided
> > by many business divisions.
> > And it is not convenient to perform some deployment steps on this cluster.
> > 
> 
> Same as #1. Regardless how we use the Spark cluster, we need to deploy a
> JAR in case of thin client, no?
> 
> 
> > 
> > 3. "When Ignite + Spark are used in real production it's OK to have
> > reasonable deployment overhead"
> > What about developer who want to play with this integration?
> > And want to do it quickly to see how it works in real life examples.
> > Can we do his life much easier?
> > 
> 
> We can and we should :) I'm not arguing there are usability issues with
> thick client. I'm just suggesting to fix those issues first, before we jump
> reworking the implementation.
> 
> 
> > 
> > > First of all, they will exist with thin client either.
> > 
> > Spark have an ability to deploy jars on worker and add it to application
> > tasks classpath.
> > For 2.6 we must deploy 11 additional jars to start using Ignite.
> > Please, see my example on the bottom of documentation page [1]
> > 
> > Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious
> > dependencies for Ignite integration for you?
> > And for our users? :)
> > 
> 
> No, this is not obvious. Absolutely, this is a usability issue and we
> should think how to make user's life easier.
> 
> 
> > 
> > Actually, list of dependencies will be changed in 2.7 - new version of
> > jcache, new version of h2
> > So user should change it in code or perform additional deployment steps.
> > 
> > It overkill for me.
> > 
> > On the other hand - thin client requires only 1 jar.
> > Moreover, thin client protocol have the backward compatibility.
> > So thin client will perform correctly when Ignite cluster will be updated
> > from 2.6 to 2.7.
> > So, with Spark integration via thin client we will be able to update
> > Ignite cluster and Spark integration separately.
> > For now, we should do it in one big step.
> > 
> > What do you think?
> > 
> > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > 
> > В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
> > > Guys,
> > > 
> > > From my experience, Ignite and Spark clusters typically run in the same
> > > environment, which makes client node a more preferable option. Mainly,
> > > because of performance. BTW, I doubt partition-awareness on thin client
> > > will help either, because in dataframes we only run SQL queries and I
> > > believe thin client will execute them through a proxy anyway. But correct
> > > me if I’m wrong.
> > > 
> > > Either way, it sounds like we just have usability issues with
> > 
> > Ignite/Spark
> > > integration. Why don’t we concentrate on fixing them then? For example,
> > 
> > #3
> > > can be fixed by loading XML content on master and then distributing it to
> > > workers, instead of loading on every worker independently. Then there are
> > > certain procedures like deploying JARs, etc. First of all, they will
> > 
> > exist
> > > with thin client either. Second of all, I’m sure there are ways to
> > 
> > simplify
> > > this procedures and make integration easier. My opinion is that working
> > 
> > on
> > > such improvements is going to add more value than another implementation
> > > based on thin client.
> > > 
> > > -Val
> > > 
> > > On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dma...@apache.org> wrote:
> > > 
> > > > Hello Nikolay,
> > > > 
> > > > Your proposal sounds reasonable. However, I would suggest us to wait
> > 
> > while
> > > > partition-awareness is supported for Java thin client first. With that
> > > > feature, the client can connect to any node directly while presently
> > 
> > all
> > > > the communication goes through a proxy (a node the client is connected
> > 
> > to).
> > > > All of that is bad for performance.
> > > > 
> > > > 
> > > > Vladimir, how hard would it be to support the partition-awareness for
> > 
> > Java
> > > > client? Probably, Nikolay can take over.
> > > > 
> > > > --
> > > > Denis
> > > > 
> > > > 
> > > > On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <nizhi...@apache.org>
> > > > wrote:
> > > > 
> > > > > Hello, Igniters.
> > > > > 
> > > > > Currently, Spark Data Frame integration implemented via client node
> > > > > connection.
> > > > > Whenever we need to retrieve some data into Spark worker(or master)
> > 
> > from
> > > > > Ignite we start a client node.
> > > > > 
> > > > > It has several major disadvantages:
> > > > > 
> > > > >         1. We should copy whole Ignite distribution on to each Spark
> > > > > worker [1]
> > > > >         2. We should copy whole Ignite distribution on to Spark
> > 
> > master to
> > > > > get catalogue works.
> > > > >         3. We should have the same absolute path to Ignite
> > 
> > configuration
> > > > > file on every worker and provide it during data frame construction
> > 
> > [2]
> > > > >         4. We should additionally configure Spark workerks classpath
> > 
> > to
> > > > > include Ignite libraries.
> > > > > 
> > > > > For now, almost all operation we need to do in Spark Data Frame
> > > > > integration is supported by Java Thin Client.
> > > > >         * obtain the list of caches.
> > > > >         * get cache configuration.
> > > > >         * execute SQL query.
> > > > >         * stream data to the table - don't support by the thin
> > 
> > client for
> > > > > now, but can be implemented using simple SQL INSERT statements.
> > > > > 
> > > > > Advantages of usage Java Thin Client in Spark integration(they all
> > 
> > known
> > > > > from Java Thin Client advantages):
> > > > >         1. Easy to configure: only IP addresses of server nodes are
> > > > > required.
> > > > >         2. Easy to deploy: only 1 additional jar required. No server
> > > > > side(Ignite worker) configuration required.
> > > > > 
> > > > > I propose to implement Spark Data Frame integration through Java Thin
> > > > > Client.
> > > > > 
> > > > > Thoughts?
> > > > > 
> > > > > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > > > > [2]
> > > > > 
> > > > 
> > > > 
> > 
> > https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> > > > >

signature.asc
Description: This is a digitally signed message part

Re: [DISCUSSION] Spark Data Frame through Thin Client

Reply via email to