Re: [DISCUSSION] Spark Data Frame through Thin Client

Stephen Darlington Mon, 22 Oct 2018 03:17:15 -0700

Ignite doesn’t currently support Spark Structured Streaming:

https://issues.apache.org/jira/browse/IGNITE-9357 
<https://issues.apache.org/jira/browse/IGNITE-9357>


There’s a working patch associated with it.

Regards,
Stephen

> On 22 Oct 2018, at 10:43, Nikolay Izhikov <nizhi...@apache.org> wrote:
> 
> Hello, Stephen.
> 
> I suggest thin client deployment as a second option together with existing 
> integration that use Client Node.
> 
>> I’m thinking specifically about better support for Spark Streaming, where 
>> the lack  of continuous query support in thin clients removes a significant 
>> optimisation option. 
> 
> It's very interesting.
> Can you share you thoughts?
> What can be improved in Spark integration?
> 
> В Пн, 22/10/2018 в 10:22 +0100, Stephen Darlington пишет:
>> Are you suggesting making the Thin Client deployment an option or as a 
>> replacement for the thick-client? If the latter, do we risk making future 
>> desirable changes more difficult (or impossible)? I’m thinking specifically 
>> about better support for Spark Streaming, where the lack  of continuous 
>> query support in thin clients removes a significant optimisation option. I’m 
>> sure there are other use cases.
>> 
>> Regards,
>> Stephen
>> 
>>> On 21 Oct 2018, at 09:08, Nikolay Izhikov <nizhi...@apache.org> wrote:
>>> 
>>> Valentin.
>>> 
>>> Seems, You made several suggestions, which is not always true, from my 
>>> point of view:
>>> 
>>> 1. "We have access to Spark cluster installation to perform deployment 
>>> steps" - this is not true in cloud or enterprise environment.
>>> 
>>> 2. "Spark cluster is used only for Ignite integration".
>>> From what I know computational resources for big Spark cluster is divided 
>>> by many business divisions.
>>> And it is not convenient to perform some deployment steps on this cluster.
>>> 
>>> 3. "When Ignite + Spark are used in real production it's OK to have 
>>> reasonable deployment overhead"
>>> What about developer who want to play with this integration?
>>> And want to do it quickly to see how it works in real life examples.
>>> Can we do his life much easier?
>>> 
>>>> First of all, they will exist with thin client either.
>>> 
>>> Spark have an ability to deploy jars on worker and add it to application 
>>> tasks classpath.
>>> For 2.6 we must deploy 11 additional jars to start using Ignite.
>>> Please, see my example on the bottom of documentation page [1]
>>> 
>>> Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies 
>>> for Ignite integration for you?
>>> And for our users? :)
>>> 
>>> Actually, list of dependencies will be changed in 2.7 - new version of 
>>> jcache, new version of h2
>>> So user should change it in code or perform additional deployment steps.
>>> 
>>> It overkill for me.
>>> 
>>> On the other hand - thin client requires only 1 jar.
>>> Moreover, thin client protocol have the backward compatibility.
>>> So thin client will perform correctly when Ignite cluster will be updated 
>>> from 2.6 to 2.7.
>>> So, with Spark integration via thin client we will be able to update Ignite 
>>> cluster and Spark integration separately.
>>> For now, we should do it in one big step.
>>> 
>>> What do you think?
>>> 
>>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
>>> 
>>> В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
>>>> Guys,
>>>> 
>>>> From my experience, Ignite and Spark clusters typically run in the same
>>>> environment, which makes client node a more preferable option. Mainly,
>>>> because of performance. BTW, I doubt partition-awareness on thin client
>>>> will help either, because in dataframes we only run SQL queries and I
>>>> believe thin client will execute them through a proxy anyway. But correct
>>>> me if I’m wrong.
>>>> 
>>>> Either way, it sounds like we just have usability issues with Ignite/Spark
>>>> integration. Why don’t we concentrate on fixing them then? For example, #3
>>>> can be fixed by loading XML content on master and then distributing it to
>>>> workers, instead of loading on every worker independently. Then there are
>>>> certain procedures like deploying JARs, etc. First of all, they will exist
>>>> with thin client either. Second of all, I’m sure there are ways to simplify
>>>> this procedures and make integration easier. My opinion is that working on
>>>> such improvements is going to add more value than another implementation
>>>> based on thin client.
>>>> 
>>>> -Val
>>>> 
>>>> On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dma...@apache.org> wrote:
>>>> 
>>>>> Hello Nikolay,
>>>>> 
>>>>> Your proposal sounds reasonable. However, I would suggest us to wait while
>>>>> partition-awareness is supported for Java thin client first. With that
>>>>> feature, the client can connect to any node directly while presently all
>>>>> the communication goes through a proxy (a node the client is connected 
>>>>> to).
>>>>> All of that is bad for performance.
>>>>> 
>>>>> 
>>>>> Vladimir, how hard would it be to support the partition-awareness for Java
>>>>> client? Probably, Nikolay can take over.
>>>>> 
>>>>> --
>>>>> Denis
>>>>> 
>>>>> 
>>>>> On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <nizhi...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Hello, Igniters.
>>>>>> 
>>>>>> Currently, Spark Data Frame integration implemented via client node
>>>>>> connection.
>>>>>> Whenever we need to retrieve some data into Spark worker(or master) from
>>>>>> Ignite we start a client node.
>>>>>> 
>>>>>> It has several major disadvantages:
>>>>>> 
>>>>>>       1. We should copy whole Ignite distribution on to each Spark
>>>>>> worker [1]
>>>>>>       2. We should copy whole Ignite distribution on to Spark master to
>>>>>> get catalogue works.
>>>>>>       3. We should have the same absolute path to Ignite configuration
>>>>>> file on every worker and provide it during data frame construction [2]
>>>>>>       4. We should additionally configure Spark workerks classpath to
>>>>>> include Ignite libraries.
>>>>>> 
>>>>>> For now, almost all operation we need to do in Spark Data Frame
>>>>>> integration is supported by Java Thin Client.
>>>>>>       * obtain the list of caches.
>>>>>>       * get cache configuration.
>>>>>>       * execute SQL query.
>>>>>>       * stream data to the table - don't support by the thin client for
>>>>>> now, but can be implemented using simple SQL INSERT statements.
>>>>>> 
>>>>>> Advantages of usage Java Thin Client in Spark integration(they all known
>>>>>> from Java Thin Client advantages):
>>>>>>       1. Easy to configure: only IP addresses of server nodes are
>>>>>> required.
>>>>>>       2. Easy to deploy: only 1 additional jar required. No server
>>>>>> side(Ignite worker) configuration required.
>>>>>> 
>>>>>> I propose to implement Spark Data Frame integration through Java Thin
>>>>>> Client.
>>>>>> 
>>>>>> Thoughts?
>>>>>> 
>>>>>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
>>>>>> [2]
>>>>>> 
>>>>> 
>>>>> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
>>>>>> 
>> 
>>

Re: [DISCUSSION] Spark Data Frame through Thin Client

Reply via email to