Re: Integration of Spark and Ignite. Prototype.

Valentin Kulichenko Tue, 28 Nov 2017 17:43:41 -0800

Denis,

Agree. I will do the final review in next few days and merge the code.


-Val

On Tue, Nov 28, 2017 at 5:28 PM, Denis Magda <dma...@apache.org> wrote:

> Guys,
>
> Looking into the parallel discussion about the strategy support I would
> change my initial stance and support the idea of releasing the integration
> in its current state. Is the code ready to be merged into the master? Let’s
> concentrate on this first and handle the strategy support as a separate
> JIRA task. Agree?
>
> —
> Denis
>
> > On Nov 27, 2017, at 3:47 PM, Valentin Kulichenko <
> valentin.kuliche...@gmail.com> wrote:
> >
> > Nikolay,
> >
> > Let's estimate the strategy implementation work, and then decide weather
> to
> > merge the code in current state or not. If anything is unclear, please
> > start a separate discussion.
> >
> > -Val
> >
> > On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <nizhikov....@gmail.com>
> > wrote:
> >
> >> Hello, Val, Denis.
> >>
> >>> Personally, I think that we should release the integration only after
> >> the strategy is fully supported.
> >>
> >> I see two major reason to propose merge of DataFrame API implementation
> >> without custom strategy:
> >>
> >> 1. My PR is relatively huge, already. From my experience of interaction
> >> with Ignite community - the bigger PR becomes, the more time of
> commiters
> >> required to review PR.
> >> So, I propose to move smaller, but complete steps here.
> >>
> >> 2. It is not clear for me what exactly includes "custom strategy and
> >> optimization".
> >> Seems, that additional discussion required.
> >> I think, I can put my thoughts on the paper and start discussion right
> >> after basic implementation is done.
> >>
> >>> Custom strategy implementation is actually very important for this
> >> integration.
> >>
> >> Understand and fully agreed.
> >> I'm ready to continue work in that area.
> >>
> >> 23.11.2017 02:15, Denis Magda пишет:
> >>
> >> Val, Nikolay,
> >>>
> >>> Personally, I think that we should release the integration only after
> the
> >>> strategy is fully supported. Without the strategy we don’t really
> leverage
> >>> from Ignite’s SQL engine and introduce redundant data movement between
> >>> Ignite and Spark nodes.
> >>>
> >>> How big is the effort to support the strategy in terms of the amount of
> >>> work left? 40%, 60%, 80%?
> >>>
> >>> —
> >>> Denis
> >>>
> >>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
> >>>> valentin.kuliche...@gmail.com> wrote:
> >>>>
> >>>> Nikolay,
> >>>>
> >>>> Custom strategy implementation is actually very important for this
> >>>> integration. Basically, it will allow to create a SQL query for Ignite
> >>>> and
> >>>> execute it directly on the cluster. Your current implementation only
> >>>> adds a
> >>>> new DataSource which means that Spark will fetch data in its own
> memory
> >>>> first, and then do most of the work (like joins for example). Does it
> >>>> make
> >>>> sense to you? Can you please take a look at this and provide your
> >>>> thoughts
> >>>> on how much development is implied there?
> >>>>
> >>>> Current code looks good to me though and I'm OK if the strategy is
> >>>> implemented as a next step in a scope of separate ticket. I will do
> final
> >>>> review early next week and will merge it if everything is OK.
> >>>>
> >>>> -Val
> >>>>
> >>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <
> nizhikov....@gmail.com>
> >>>> wrote:
> >>>>
> >>>> Hello.
> >>>>>
> >>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> Catalog
> >>>>>>
> >>>>> implementations and what is the difference?
> >>>>>
> >>>>> IgniteCatalog removed.
> >>>>>
> >>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
> >>>>>>
> >>>>> set manually on SQLContext each time it's created....Is there any
> way to
> >>>>> automate this and improve usability?
> >>>>>
> >>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
> >>>>>
> >>>>> Actually, I think it makes sense to create a builder similar to
> >>>>>>
> >>>>> SparkSession.builder()...
> >>>>>
> >>>>> IgniteBuilder added.
> >>>>> Syntax looks like:
> >>>>>
> >>>>> ```
> >>>>> val igniteSession = IgniteSparkSession.builder()
> >>>>>    .appName("Spark Ignite catalog example")
> >>>>>    .master("local")
> >>>>>    .config("spark.executor.instances", "2")
> >>>>>    .igniteConfig(CONFIG)
> >>>>>    .getOrCreate()
> >>>>>
> >>>>> igniteSession.catalog.listTables().show()
> >>>>> ```
> >>>>>
> >>>>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
> >>>>>
> >>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <nizhikov....@gmail.com>:
> >>>>>
> >>>>> Hello, Valentin.
> >>>>>>
> >>>>>> My answers is below.
> >>>>>> Dmitry, do we need to move discussion to Jira?
> >>>>>>
> >>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
> codebase?
> >>>>>>>
> >>>>>>
> >>>>>> As I mentioned earlier, to implement and override Spark Catalog one
> >>>>>> have
> >>>>>> to use internal(private) Spark API.
> >>>>>> So I have to use package `org.spark.sql.***` to have access to
> private
> >>>>>> class and variables.
> >>>>>>
> >>>>>> For example, SharedState class that stores link to ExternalCatalog
> >>>>>> declared as `private[sql] class SharedState` - i.e. package private.
> >>>>>>
> >>>>>> Can these classes reside under org.apache.ignite.spark instead?
> >>>>>>>
> >>>>>>
> >>>>>> No, as long as we want to have our own implementation of
> >>>>>> ExternalCatalog.
> >>>>>>
> >>>>>> 2. IgniteRelationProvider contains multiple constants which I guess
> are
> >>>>>>>
> >>>>>> some king of config options. Can you describe the purpose of each of
> >>>>>> them?
> >>>>>>
> >>>>>> I extend comments for this options.
> >>>>>> Please, see my commit [1] or PR HEAD:
> >>>>>>
> >>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> Catalog
> >>>>>>>
> >>>>>> implementations and what is the difference?
> >>>>>>
> >>>>>> Good catch, thank you!
> >>>>>> After additional research I founded that only IgniteExternalCatalog
> >>>>>> required.
> >>>>>> I will update PR with IgniteCatalog remove in a few days.
> >>>>>>
> >>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
> are
> >>>>>>>
> >>>>>> our plans on implementing them? Also, what exactly is planned in
> >>>>>> IgniteOptimization and what is its purpose?
> >>>>>>
> >>>>>> Actually, this is very good question :)
> >>>>>> And I need advice from experienced community members here:
> >>>>>>
> >>>>>> `IgniteOptimization` purpose is to modify query plan created by
> Spark.
> >>>>>> Currently, we have one optimization described in IGNITE-3084 [2] by
> >>>>>> you,
> >>>>>> Valentin :) :
> >>>>>>
> >>>>>> “If there are non-Ignite relations in the plan, we should fall back
> to
> >>>>>> native Spark strategies“
> >>>>>>
> >>>>>> I think we can go little further and reduce join of two Ignite
> backed
> >>>>>> Data Frames into single Ignite SQL query. Currently, this feature is
> >>>>>> unimplemented.
> >>>>>>
> >>>>>> *Do we need it now? Or we can postpone it and concentrates on basic
> >>>>>> Data
> >>>>>> Frame and Catalog implementation?*
> >>>>>>
> >>>>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
> >>>>>> LogicalPlan into physical operators.
> >>>>>> I don’t have ideas how to use this opportunity. So I think we don’t
> >>>>>> need
> >>>>>> IgniteStrategy.
> >>>>>>
> >>>>>> Can you or anyone else suggest some optimization strategy to speed
> up
> >>>>>> SQL
> >>>>>> query execution?
> >>>>>>
> >>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to
> be
> >>>>>>>
> >>>>>> set manually on SQLContext each time it's created....Is there any
> way
> >>>>>> to
> >>>>>> automate this and improve usability?
> >>>>>>
> >>>>>> These classes added to `extraOptimizations` when one using
> >>>>>> IgniteSparkSession.
> >>>>>> As far as I know, there is no way to automatically add these
> classes to
> >>>>>> regular SparkSession.
> >>>>>>
> >>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
> >>>>>>>
> >>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
> >>>>>> Confusing.
> >>>>>>
> >>>>>> DataFrame API is *public* Spark API. So anyone can provide
> >>>>>> implementation
> >>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t
> need
> >>>>>> any
> >>>>>> Ignite specific session.
> >>>>>>
> >>>>>> Catalog API is *internal* Spark API. There is no way to plug custom
> >>>>>> catalog implementation into Spark [3]. So we have to use
> >>>>>> `IgniteSparkSession` that extends regular SparkSession and overrides
> >>>>>> links
> >>>>>> to `ExternalCatalog`.
> >>>>>>
> >>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
> >>>>>>>
> >>>>>> really needed? It looks like we can directly provide the
> configuration
> >>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
> >>>>>> create it
> >>>>>> by itself under the hood.
> >>>>>>
> >>>>>> Actually, IgniteContext is base class for Ignite <-> Spark
> integration
> >>>>>> for now. So I tried to reuse it here. I like the idea to remove
> >>>>>> explicit
> >>>>>> usage of IgniteContext.
> >>>>>> Will implement it in a few days.
> >>>>>>
> >>>>>> Actually, I think it makes sense to create a builder similar to
> >>>>>>>
> >>>>>> SparkSession.builder()...
> >>>>>>
> >>>>>> Great idea! I will implement such builder in a few days.
> >>>>>>
> >>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
> case
> >>>>>>>
> >>>>>> when we don't have SQL configured on Ignite side?
> >>>>>>
> >>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a
> key-value
> >>>>>> cache.
> >>>>>>
> >>>>>> I thought we decided not to support this, no? Or this is something
> >>>>>>> else?
> >>>>>>>
> >>>>>>
> >>>>>> My understanding is following:
> >>>>>>
> >>>>>> 1. We can’t support automatic resolving key-value caches in
> >>>>>> *ExternalCatalog*. Because there is no way to reliably detect key
> and
> >>>>>> value
> >>>>>> classes.
> >>>>>>
> >>>>>> 2. We can support key-value caches in regular Data Frame
> >>>>>> implementation.
> >>>>>> Because we can require user to provide key and value classes
> >>>>>> explicitly.
> >>>>>>
> >>>>>> 8. Can you clarify the query syntax in
> IgniteDataFrameExample#nativeS
> >>>>>>>
> >>>>>> parkSqlFromCacheExample2?
> >>>>>>
> >>>>>> Key-value cache:
> >>>>>>
> >>>>>> key - java.lang.Long,
> >>>>>> value - case class Person(name: String, birthDate: java.util.Date)
> >>>>>>
> >>>>>> Schema of data frame for cache is:
> >>>>>>
> >>>>>> key - long
> >>>>>> value.name - string
> >>>>>> value.birthDate - date
> >>>>>>
> >>>>>> So we can select data from data from cache:
> >>>>>>
> >>>>>> SELECT
> >>>>>>  key, `value.name`,  `value.birthDate`
> >>>>>> FROM
> >>>>>>  testCache
> >>>>>> WHERE key >= 2 AND `value.name` like '%0'
> >>>>>>
> >>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
> >>>>>> bf417bc59b0519156fd4d09114c8da7
> >>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
> >>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
> >>>>>> abpanels:comment-tabpanel#comment-15794210
> >>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
> >>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
> >>>>>> abpanels:comment-tabpanel#comment-15543733
> >>>>>>
> >>>>>>
> >>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
> >>>>>>
> >>>>>> Val, thanks for the review. Can I ask you to add the same comments
> to
> >>>>>> the
> >>>>>>
> >>>>>>> ticket?
> >>>>>>>
> >>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
> >>>>>>> valentin.kuliche...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Nikolay, Anton,
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I did a high level review of the code. First of all, impressive
> >>>>>>>> results!
> >>>>>>>> However, I have some questions/comments.
> >>>>>>>>
> >>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
> >>>>>>>> codebase?
> >>>>>>>> Can
> >>>>>>>> these classes reside under org.apache.ignite.spark instead?
> >>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
> guess
> >>>>>>>> are
> >>>>>>>> some king of config options. Can you describe the purpose of each
> of
> >>>>>>>> them?
> >>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> >>>>>>>> Catalog
> >>>>>>>> implementations and what is the difference?
> >>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
> >>>>>>>> are
> >>>>>>>> our
> >>>>>>>> plans on implementing them? Also, what exactly is planned in
> >>>>>>>> IgniteOptimization and what is its purpose?
> >>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
> to be
> >>>>>>>> set
> >>>>>>>> manually on SQLContext each time it's created. This seems to be
> very
> >>>>>>>> error
> >>>>>>>> prone. Is there any way to automate this and improve usability?
> >>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
> >>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which
> is
> >>>>>>>> confusing.
> >>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is
> it
> >>>>>>>> really
> >>>>>>>> needed? It looks like we can directly provide the configuration
> >>>>>>>> file; if
> >>>>>>>> IgniteSparkSession really requires IgniteContext, it can create
> it by
> >>>>>>>> itself under the hood. Actually, I think it makes sense to create
> a
> >>>>>>>> builder
> >>>>>>>> similar to SparkSession.builder(), it would be good if our APIs
> here
> >>>>>>>> are
> >>>>>>>> consistent with Spark APIs.
> >>>>>>>> 8. Can you clarify the query syntax
> >>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
> >>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
> case
> >>>>>>>> when
> >>>>>>>> we don't have SQL configured on Ignite side? I thought we decided
> >>>>>>>> not to
> >>>>>>>> support this, no? Or this is something else?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> -Val
> >>>>>>>>
> >>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
> >>>>>>>> avinogra...@gridgain.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Sounds awesome.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I'll try to review API & tests this week.
> >>>>>>>>>
> >>>>>>>>> Val,
> >>>>>>>>> Your review still required :)
> >>>>>>>>>
> >>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
> >>>>>>>>> nizhikov....@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Yes
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
> >>>>>>>>>> avinogra...@gridgain.com> написал:
> >>>>>>>>>>
> >>>>>>>>>> Nikolay,
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> So, it will be able to start regular spark and ignite clusters
> >>>>>>>>>>> and,
> >>>>>>>>>>>
> >>>>>>>>>>> using
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> peer classloading via spark-context, perform any DataFrame
> request,
> >>>>>>>>>>
> >>>>>>>>>>> correct?
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
> >>>>>>>>>>>
> >>>>>>>>>>> nizhikov....@gmail.com>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hello, Anton.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> An example you provide is a path to a master *local* file.
> >>>>>>>>>>>> These libraries are added to the classpath for each remote
> node
> >>>>>>>>>>>>
> >>>>>>>>>>>> running
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> submitted job.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Please, see documentation:
> >>>>>>>>>>>>
> >>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
> >>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
> >>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
> >>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
> >>>>>>>>>>>>
> >>>>>>>>>>>> avinogra...@gridgain.com
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> :
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Nikolay,
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With Data Frame API implementation there are no requirements
> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>> any
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ignite files on spark worker nodes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> What do you mean? I see code like:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
> >>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> >>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> nizhikov....@gmail.com>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have created example application to run Ignite Data Frame
> on
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> standalone
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Spark cluster.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> With Data Frame API implementation there are no
> requirements to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>> any
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ignite files on spark worker nodes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> statistics.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
> >>>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> >>>>>>>>>>>>>> valentin.kuliche...@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Nikolay,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I
> will
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>> my
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> best
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> review the code this week.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Val
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> nizhikov....@gmail.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello, Valentin.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Now I think I have done almost all required features.
> >>>>>>>>>>>>>>>> I want to make some performance test to ensure my
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>> work
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> properly with a significant amount of data.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> And I definitely need some feedback for my changes.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> nizhikov....@gmail.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>> :
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>     * Can be run on JDK 7.
> >>>>>>>>>>>>>>>>>     * Still supported: 2.1.2 will be released soon.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>     * Can be run only on JDK 8+
> >>>>>>>>>>>>>>>>>     * Released Jul 11, 2017.
> >>>>>>>>>>>>>>>>>     * Already supported by huge vendors(Amazon for
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> example).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> API.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>> So It will take some effort to switch between Spark 2.1 and
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2.2
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> >>>>>>>>>>>>>>>>> valentin.kuliche...@gmail.com>:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I will review in the next few days.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> -Val
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> dma...@apache.org
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello Nikolay,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Ignite.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Answering on your questions.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> measurements.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>> As a
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Spark
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> integration.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>> or
> >>>>>>>>>>
> >>>>>>>>>> Spark +
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> RDBMS cases.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> module
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>> ignite-spark integration.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> —
> >>>>>>>>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>> Ignite
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> [1].
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For now, I implement following:
> >>>>>>>>>>>>>>>>>>>>    * Ignite DataSource implementation(
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>    * DataFrame support for Ignite SQL table.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>    * IgniteCatalog implementation for a transparent
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> resolving
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> of
> >>>>>>>>>>>>
> >>>>>>>>>>>> ignites
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> SQL tables.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
> >>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> prototype.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I made some examples in PR so you can see how API
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> suppose
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> be
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> used [3].
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> [4].
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> scalability
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> tests, etc.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
> >>>>>>>>>>>>>>>>>>>> What are your thoughts?
> >>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> inside
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>> Ignite
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>> codebase?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> *internal
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> Spark
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> API*.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> API
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>> public
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> inside
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> Ignite
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> code
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> base?*
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>> in
> >>>>>>>>>
> >>>>>>>>> some
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> external
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> module?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> still
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>> can
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> support
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
> >>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
> >>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> f4ff509cef3018e221394474775e0905
> >>>>>>>>>>>
> >>>>>>>>>>>> [4] https://github.com/apache/
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
> >>>>>>>>>>>
> >>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
> >>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
> >>>>>>>>>>
> >>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>>> nizhikov....@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>> nizhikov....@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>> nizhikov....@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>> nizhikov....@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>> --
> >>>>> Nikolay Izhikov
> >>>>> nizhikov....@gmail.com
> >>>>>
> >>>>>
> >>>
>
>

Re: Integration of Spark and Ignite. Prototype.

Reply via email to