Ignite doesn’t currently support Spark Structured Streaming: https://issues.apache.org/jira/browse/IGNITE-9357 <https://issues.apache.org/jira/browse/IGNITE-9357>
There’s a working patch associated with it. Regards, Stephen > On 22 Oct 2018, at 10:43, Nikolay Izhikov <nizhi...@apache.org> wrote: > > Hello, Stephen. > > I suggest thin client deployment as a second option together with existing > integration that use Client Node. > >> I’m thinking specifically about better support for Spark Streaming, where >> the lack of continuous query support in thin clients removes a significant >> optimisation option. > > It's very interesting. > Can you share you thoughts? > What can be improved in Spark integration? > > В Пн, 22/10/2018 в 10:22 +0100, Stephen Darlington пишет: >> Are you suggesting making the Thin Client deployment an option or as a >> replacement for the thick-client? If the latter, do we risk making future >> desirable changes more difficult (or impossible)? I’m thinking specifically >> about better support for Spark Streaming, where the lack of continuous >> query support in thin clients removes a significant optimisation option. I’m >> sure there are other use cases. >> >> Regards, >> Stephen >> >>> On 21 Oct 2018, at 09:08, Nikolay Izhikov <nizhi...@apache.org> wrote: >>> >>> Valentin. >>> >>> Seems, You made several suggestions, which is not always true, from my >>> point of view: >>> >>> 1. "We have access to Spark cluster installation to perform deployment >>> steps" - this is not true in cloud or enterprise environment. >>> >>> 2. "Spark cluster is used only for Ignite integration". >>> From what I know computational resources for big Spark cluster is divided >>> by many business divisions. >>> And it is not convenient to perform some deployment steps on this cluster. >>> >>> 3. "When Ignite + Spark are used in real production it's OK to have >>> reasonable deployment overhead" >>> What about developer who want to play with this integration? >>> And want to do it quickly to see how it works in real life examples. >>> Can we do his life much easier? >>> >>>> First of all, they will exist with thin client either. >>> >>> Spark have an ability to deploy jars on worker and add it to application >>> tasks classpath. >>> For 2.6 we must deploy 11 additional jars to start using Ignite. >>> Please, see my example on the bottom of documentation page [1] >>> >>> Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies >>> for Ignite integration for you? >>> And for our users? :) >>> >>> Actually, list of dependencies will be changed in 2.7 - new version of >>> jcache, new version of h2 >>> So user should change it in code or perform additional deployment steps. >>> >>> It overkill for me. >>> >>> On the other hand - thin client requires only 1 jar. >>> Moreover, thin client protocol have the backward compatibility. >>> So thin client will perform correctly when Ignite cluster will be updated >>> from 2.6 to 2.7. >>> So, with Spark integration via thin client we will be able to update Ignite >>> cluster and Spark integration separately. >>> For now, we should do it in one big step. >>> >>> What do you think? >>> >>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment >>> >>> В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет: >>>> Guys, >>>> >>>> From my experience, Ignite and Spark clusters typically run in the same >>>> environment, which makes client node a more preferable option. Mainly, >>>> because of performance. BTW, I doubt partition-awareness on thin client >>>> will help either, because in dataframes we only run SQL queries and I >>>> believe thin client will execute them through a proxy anyway. But correct >>>> me if I’m wrong. >>>> >>>> Either way, it sounds like we just have usability issues with Ignite/Spark >>>> integration. Why don’t we concentrate on fixing them then? For example, #3 >>>> can be fixed by loading XML content on master and then distributing it to >>>> workers, instead of loading on every worker independently. Then there are >>>> certain procedures like deploying JARs, etc. First of all, they will exist >>>> with thin client either. Second of all, I’m sure there are ways to simplify >>>> this procedures and make integration easier. My opinion is that working on >>>> such improvements is going to add more value than another implementation >>>> based on thin client. >>>> >>>> -Val >>>> >>>> On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dma...@apache.org> wrote: >>>> >>>>> Hello Nikolay, >>>>> >>>>> Your proposal sounds reasonable. However, I would suggest us to wait while >>>>> partition-awareness is supported for Java thin client first. With that >>>>> feature, the client can connect to any node directly while presently all >>>>> the communication goes through a proxy (a node the client is connected >>>>> to). >>>>> All of that is bad for performance. >>>>> >>>>> >>>>> Vladimir, how hard would it be to support the partition-awareness for Java >>>>> client? Probably, Nikolay can take over. >>>>> >>>>> -- >>>>> Denis >>>>> >>>>> >>>>> On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <nizhi...@apache.org> >>>>> wrote: >>>>> >>>>>> Hello, Igniters. >>>>>> >>>>>> Currently, Spark Data Frame integration implemented via client node >>>>>> connection. >>>>>> Whenever we need to retrieve some data into Spark worker(or master) from >>>>>> Ignite we start a client node. >>>>>> >>>>>> It has several major disadvantages: >>>>>> >>>>>> 1. We should copy whole Ignite distribution on to each Spark >>>>>> worker [1] >>>>>> 2. We should copy whole Ignite distribution on to Spark master to >>>>>> get catalogue works. >>>>>> 3. We should have the same absolute path to Ignite configuration >>>>>> file on every worker and provide it during data frame construction [2] >>>>>> 4. We should additionally configure Spark workerks classpath to >>>>>> include Ignite libraries. >>>>>> >>>>>> For now, almost all operation we need to do in Spark Data Frame >>>>>> integration is supported by Java Thin Client. >>>>>> * obtain the list of caches. >>>>>> * get cache configuration. >>>>>> * execute SQL query. >>>>>> * stream data to the table - don't support by the thin client for >>>>>> now, but can be implemented using simple SQL INSERT statements. >>>>>> >>>>>> Advantages of usage Java Thin Client in Spark integration(they all known >>>>>> from Java Thin Client advantages): >>>>>> 1. Easy to configure: only IP addresses of server nodes are >>>>>> required. >>>>>> 2. Easy to deploy: only 1 additional jar required. No server >>>>>> side(Ignite worker) configuration required. >>>>>> >>>>>> I propose to implement Spark Data Frame integration through Java Thin >>>>>> Client. >>>>>> >>>>>> Thoughts? >>>>>> >>>>>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment >>>>>> [2] >>>>>> >>>>> >>>>> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options >>>>>> >> >>