Re: Integration of Spark and Ignite. Prototype.

Denis Magda Mon, 04 Dec 2017 13:31:15 -0800

Nikolay, Val,

Since we agreed to release the feature without the strategy support, can the 
current integration meet the world in 2.4 release? Please chime in this 
conversation:
http://apache-ignite-developers.2346864.n4.nabble.com/Time-and-scope-for-Apache-Ignite-2-4-td24987.html


—
Denis

> On Nov 28, 2017, at 5:42 PM, Valentin Kulichenko 
> <valentin.kuliche...@gmail.com> wrote:
> 
> Denis,
> 
> Agree. I will do the final review in next few days and merge the code.
> 
> -Val
> 
> On Tue, Nov 28, 2017 at 5:28 PM, Denis Magda <dma...@apache.org> wrote:
> 
>> Guys,
>> 
>> Looking into the parallel discussion about the strategy support I would
>> change my initial stance and support the idea of releasing the integration
>> in its current state. Is the code ready to be merged into the master? Let’s
>> concentrate on this first and handle the strategy support as a separate
>> JIRA task. Agree?
>> 
>> —
>> Denis
>> 
>>> On Nov 27, 2017, at 3:47 PM, Valentin Kulichenko <
>> valentin.kuliche...@gmail.com> wrote:
>>> 
>>> Nikolay,
>>> 
>>> Let's estimate the strategy implementation work, and then decide weather
>> to
>>> merge the code in current state or not. If anything is unclear, please
>>> start a separate discussion.
>>> 
>>> -Val
>>> 
>>> On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <nizhikov....@gmail.com>
>>> wrote:
>>> 
>>>> Hello, Val, Denis.
>>>> 
>>>>> Personally, I think that we should release the integration only after
>>>> the strategy is fully supported.
>>>> 
>>>> I see two major reason to propose merge of DataFrame API implementation
>>>> without custom strategy:
>>>> 
>>>> 1. My PR is relatively huge, already. From my experience of interaction
>>>> with Ignite community - the bigger PR becomes, the more time of
>> commiters
>>>> required to review PR.
>>>> So, I propose to move smaller, but complete steps here.
>>>> 
>>>> 2. It is not clear for me what exactly includes "custom strategy and
>>>> optimization".
>>>> Seems, that additional discussion required.
>>>> I think, I can put my thoughts on the paper and start discussion right
>>>> after basic implementation is done.
>>>> 
>>>>> Custom strategy implementation is actually very important for this
>>>> integration.
>>>> 
>>>> Understand and fully agreed.
>>>> I'm ready to continue work in that area.
>>>> 
>>>> 23.11.2017 02:15, Denis Magda пишет:
>>>> 
>>>> Val, Nikolay,
>>>>> 
>>>>> Personally, I think that we should release the integration only after
>> the
>>>>> strategy is fully supported. Without the strategy we don’t really
>> leverage
>>>>> from Ignite’s SQL engine and introduce redundant data movement between
>>>>> Ignite and Spark nodes.
>>>>> 
>>>>> How big is the effort to support the strategy in terms of the amount of
>>>>> work left? 40%, 60%, 80%?
>>>>> 
>>>>> —
>>>>> Denis
>>>>> 
>>>>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
>>>>>> valentin.kuliche...@gmail.com> wrote:
>>>>>> 
>>>>>> Nikolay,
>>>>>> 
>>>>>> Custom strategy implementation is actually very important for this
>>>>>> integration. Basically, it will allow to create a SQL query for Ignite
>>>>>> and
>>>>>> execute it directly on the cluster. Your current implementation only
>>>>>> adds a
>>>>>> new DataSource which means that Spark will fetch data in its own
>> memory
>>>>>> first, and then do most of the work (like joins for example). Does it
>>>>>> make
>>>>>> sense to you? Can you please take a look at this and provide your
>>>>>> thoughts
>>>>>> on how much development is implied there?
>>>>>> 
>>>>>> Current code looks good to me though and I'm OK if the strategy is
>>>>>> implemented as a next step in a scope of separate ticket. I will do
>> final
>>>>>> review early next week and will merge it if everything is OK.
>>>>>> 
>>>>>> -Val
>>>>>> 
>>>>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <
>> nizhikov....@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Hello.
>>>>>>> 
>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>> Catalog
>>>>>>>> 
>>>>>>> implementations and what is the difference?
>>>>>>> 
>>>>>>> IgniteCatalog removed.
>>>>>>> 
>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>>>> 
>>>>>>> set manually on SQLContext each time it's created....Is there any
>> way to
>>>>>>> automate this and improve usability?
>>>>>>> 
>>>>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
>>>>>>> 
>>>>>>> Actually, I think it makes sense to create a builder similar to
>>>>>>>> 
>>>>>>> SparkSession.builder()...
>>>>>>> 
>>>>>>> IgniteBuilder added.
>>>>>>> Syntax looks like:
>>>>>>> 
>>>>>>> ```
>>>>>>> val igniteSession = IgniteSparkSession.builder()
>>>>>>>   .appName("Spark Ignite catalog example")
>>>>>>>   .master("local")
>>>>>>>   .config("spark.executor.instances", "2")
>>>>>>>   .igniteConfig(CONFIG)
>>>>>>>   .getOrCreate()
>>>>>>> 
>>>>>>> igniteSession.catalog.listTables().show()
>>>>>>> ```
>>>>>>> 
>>>>>>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
>>>>>>> 
>>>>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <nizhikov....@gmail.com>:
>>>>>>> 
>>>>>>> Hello, Valentin.
>>>>>>>> 
>>>>>>>> My answers is below.
>>>>>>>> Dmitry, do we need to move discussion to Jira?
>>>>>>>> 
>>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>> codebase?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> As I mentioned earlier, to implement and override Spark Catalog one
>>>>>>>> have
>>>>>>>> to use internal(private) Spark API.
>>>>>>>> So I have to use package `org.spark.sql.***` to have access to
>> private
>>>>>>>> class and variables.
>>>>>>>> 
>>>>>>>> For example, SharedState class that stores link to ExternalCatalog
>>>>>>>> declared as `private[sql] class SharedState` - i.e. package private.
>>>>>>>> 
>>>>>>>> Can these classes reside under org.apache.ignite.spark instead?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> No, as long as we want to have our own implementation of
>>>>>>>> ExternalCatalog.
>>>>>>>> 
>>>>>>>> 2. IgniteRelationProvider contains multiple constants which I guess
>> are
>>>>>>>>> 
>>>>>>>> some king of config options. Can you describe the purpose of each of
>>>>>>>> them?
>>>>>>>> 
>>>>>>>> I extend comments for this options.
>>>>>>>> Please, see my commit [1] or PR HEAD:
>>>>>>>> 
>>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>> Catalog
>>>>>>>>> 
>>>>>>>> implementations and what is the difference?
>>>>>>>> 
>>>>>>>> Good catch, thank you!
>>>>>>>> After additional research I founded that only IgniteExternalCatalog
>>>>>>>> required.
>>>>>>>> I will update PR with IgniteCatalog remove in a few days.
>>>>>>>> 
>>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
>> are
>>>>>>>>> 
>>>>>>>> our plans on implementing them? Also, what exactly is planned in
>>>>>>>> IgniteOptimization and what is its purpose?
>>>>>>>> 
>>>>>>>> Actually, this is very good question :)
>>>>>>>> And I need advice from experienced community members here:
>>>>>>>> 
>>>>>>>> `IgniteOptimization` purpose is to modify query plan created by
>> Spark.
>>>>>>>> Currently, we have one optimization described in IGNITE-3084 [2] by
>>>>>>>> you,
>>>>>>>> Valentin :) :
>>>>>>>> 
>>>>>>>> “If there are non-Ignite relations in the plan, we should fall back
>> to
>>>>>>>> native Spark strategies“
>>>>>>>> 
>>>>>>>> I think we can go little further and reduce join of two Ignite
>> backed
>>>>>>>> Data Frames into single Ignite SQL query. Currently, this feature is
>>>>>>>> unimplemented.
>>>>>>>> 
>>>>>>>> *Do we need it now? Or we can postpone it and concentrates on basic
>>>>>>>> Data
>>>>>>>> Frame and Catalog implementation?*
>>>>>>>> 
>>>>>>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
>>>>>>>> LogicalPlan into physical operators.
>>>>>>>> I don’t have ideas how to use this opportunity. So I think we don’t
>>>>>>>> need
>>>>>>>> IgniteStrategy.
>>>>>>>> 
>>>>>>>> Can you or anyone else suggest some optimization strategy to speed
>> up
>>>>>>>> SQL
>>>>>>>> query execution?
>>>>>>>> 
>>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to
>> be
>>>>>>>>> 
>>>>>>>> set manually on SQLContext each time it's created....Is there any
>> way
>>>>>>>> to
>>>>>>>> automate this and improve usability?
>>>>>>>> 
>>>>>>>> These classes added to `extraOptimizations` when one using
>>>>>>>> IgniteSparkSession.
>>>>>>>> As far as I know, there is no way to automatically add these
>> classes to
>>>>>>>> regular SparkSession.
>>>>>>>> 
>>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
>>>>>>>>> 
>>>>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>>>>> Confusing.
>>>>>>>> 
>>>>>>>> DataFrame API is *public* Spark API. So anyone can provide
>>>>>>>> implementation
>>>>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t
>> need
>>>>>>>> any
>>>>>>>> Ignite specific session.
>>>>>>>> 
>>>>>>>> Catalog API is *internal* Spark API. There is no way to plug custom
>>>>>>>> catalog implementation into Spark [3]. So we have to use
>>>>>>>> `IgniteSparkSession` that extends regular SparkSession and overrides
>>>>>>>> links
>>>>>>>> to `ExternalCatalog`.
>>>>>>>> 
>>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>>>>> 
>>>>>>>> really needed? It looks like we can directly provide the
>> configuration
>>>>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
>>>>>>>> create it
>>>>>>>> by itself under the hood.
>>>>>>>> 
>>>>>>>> Actually, IgniteContext is base class for Ignite <-> Spark
>> integration
>>>>>>>> for now. So I tried to reuse it here. I like the idea to remove
>>>>>>>> explicit
>>>>>>>> usage of IgniteContext.
>>>>>>>> Will implement it in a few days.
>>>>>>>> 
>>>>>>>> Actually, I think it makes sense to create a builder similar to
>>>>>>>>> 
>>>>>>>> SparkSession.builder()...
>>>>>>>> 
>>>>>>>> Great idea! I will implement such builder in a few days.
>>>>>>>> 
>>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
>> case
>>>>>>>>> 
>>>>>>>> when we don't have SQL configured on Ignite side?
>>>>>>>> 
>>>>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a
>> key-value
>>>>>>>> cache.
>>>>>>>> 
>>>>>>>> I thought we decided not to support this, no? Or this is something
>>>>>>>>> else?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> My understanding is following:
>>>>>>>> 
>>>>>>>> 1. We can’t support automatic resolving key-value caches in
>>>>>>>> *ExternalCatalog*. Because there is no way to reliably detect key
>> and
>>>>>>>> value
>>>>>>>> classes.
>>>>>>>> 
>>>>>>>> 2. We can support key-value caches in regular Data Frame
>>>>>>>> implementation.
>>>>>>>> Because we can require user to provide key and value classes
>>>>>>>> explicitly.
>>>>>>>> 
>>>>>>>> 8. Can you clarify the query syntax in
>> IgniteDataFrameExample#nativeS
>>>>>>>>> 
>>>>>>>> parkSqlFromCacheExample2?
>>>>>>>> 
>>>>>>>> Key-value cache:
>>>>>>>> 
>>>>>>>> key - java.lang.Long,
>>>>>>>> value - case class Person(name: String, birthDate: java.util.Date)
>>>>>>>> 
>>>>>>>> Schema of data frame for cache is:
>>>>>>>> 
>>>>>>>> key - long
>>>>>>>> value.name - string
>>>>>>>> value.birthDate - date
>>>>>>>> 
>>>>>>>> So we can select data from data from cache:
>>>>>>>> 
>>>>>>>> SELECT
>>>>>>>> key, `value.name`,  `value.birthDate`
>>>>>>>> FROM
>>>>>>>> testCache
>>>>>>>> WHERE key >= 2 AND `value.name` like '%0'
>>>>>>>> 
>>>>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>>>>>>>> bf417bc59b0519156fd4d09114c8da7
>>>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>>>>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>>>>>>>> abpanels:comment-tabpanel#comment-15794210
>>>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>>>>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>>>>>>>> abpanels:comment-tabpanel#comment-15543733
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>>>>>>>> 
>>>>>>>> Val, thanks for the review. Can I ask you to add the same comments
>> to
>>>>>>>> the
>>>>>>>> 
>>>>>>>>> ticket?
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>>>>>>>>> valentin.kuliche...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Nikolay, Anton,
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I did a high level review of the code. First of all, impressive
>>>>>>>>>> results!
>>>>>>>>>> However, I have some questions/comments.
>>>>>>>>>> 
>>>>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>>>>>>>>>> codebase?
>>>>>>>>>> Can
>>>>>>>>>> these classes reside under org.apache.ignite.spark instead?
>>>>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
>> guess
>>>>>>>>>> are
>>>>>>>>>> some king of config options. Can you describe the purpose of each
>> of
>>>>>>>>>> them?
>>>>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>>>>>>>>>> Catalog
>>>>>>>>>> implementations and what is the difference?
>>>>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
>>>>>>>>>> are
>>>>>>>>>> our
>>>>>>>>>> plans on implementing them? Also, what exactly is planned in
>>>>>>>>>> IgniteOptimization and what is its purpose?
>>>>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
>> to be
>>>>>>>>>> set
>>>>>>>>>> manually on SQLContext each time it's created. This seems to be
>> very
>>>>>>>>>> error
>>>>>>>>>> prone. Is there any way to automate this and improve usability?
>>>>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>>>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which
>> is
>>>>>>>>>> confusing.
>>>>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is
>> it
>>>>>>>>>> really
>>>>>>>>>> needed? It looks like we can directly provide the configuration
>>>>>>>>>> file; if
>>>>>>>>>> IgniteSparkSession really requires IgniteContext, it can create
>> it by
>>>>>>>>>> itself under the hood. Actually, I think it makes sense to create
>> a
>>>>>>>>>> builder
>>>>>>>>>> similar to SparkSession.builder(), it would be good if our APIs
>> here
>>>>>>>>>> are
>>>>>>>>>> consistent with Spark APIs.
>>>>>>>>>> 8. Can you clarify the query syntax
>>>>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>>>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
>> case
>>>>>>>>>> when
>>>>>>>>>> we don't have SQL configured on Ignite side? I thought we decided
>>>>>>>>>> not to
>>>>>>>>>> support this, no? Or this is something else?
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> -Val
>>>>>>>>>> 
>>>>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>>>>>>>>> avinogra...@gridgain.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Sounds awesome.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I'll try to review API & tests this week.
>>>>>>>>>>> 
>>>>>>>>>>> Val,
>>>>>>>>>>> Your review still required :)
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Yes
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>>>>>>>>> avinogra...@gridgain.com> написал:
>>>>>>>>>>>> 
>>>>>>>>>>>> Nikolay,
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So, it will be able to start regular spark and ignite clusters
>>>>>>>>>>>>> and,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> using
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> peer classloading via spark-context, perform any DataFrame
>> request,
>>>>>>>>>>>> 
>>>>>>>>>>>>> correct?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>>>>>>>> 
>>>>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello, Anton.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>>>>>>>>> These libraries are added to the classpath for each remote
>> node
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> running
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> submitted job.
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Please, see documentation:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> avinogra...@gridgain.com
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Nikolay,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> With Data Frame API implementation there are no requirements
>> to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> any
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What do you mean? I see code like:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have created example application to run Ignite Data Frame
>> on
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> standalone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Spark cluster.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> With Data Frame API implementation there are no
>> requirements to
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> any
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> statistics.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>>>> valentin.kuliche...@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Nikolay,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I
>> will
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> my
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> best
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> review the code this week.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>> work
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> :
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>    * Can be run on JDK 7.
>>>>>>>>>>>>>>>>>>>    * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>    * Can be run only on JDK 8+
>>>>>>>>>>>>>>>>>>>    * Released Jul 11, 2017.
>>>>>>>>>>>>>>>>>>>    * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> example).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2.2
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>>>>>>> valentin.kuliche...@gmail.com>:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> dma...@apache.org
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Ignite.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> measurements.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> As a
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> integration.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> or
>>>>>>>>>>>> 
>>>>>>>>>>>> Spark +
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> module
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>>> 
>>>>>>>>>>>> ignite-spark integration.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> Ignite
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> [1].
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>>>>>>>>   * Ignite DataSource implementation(
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>   * DataFrame support for Ignite SQL table.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> resolving
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ignites
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> prototype.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> suppose
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>> be
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> used [3].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> scalability
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> Ignite
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> codebase?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> *internal
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> Spark
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> API*.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> public
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> Ignite
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> some
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> external
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> can
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> support
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>>>> 
>>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Nikolay Izhikov
>>>>>>> nizhikov....@gmail.com
>>>>>>> 
>>>>>>> 
>>>>> 
>> 
>>

Re: Integration of Spark and Ignite. Prototype.

Reply via email to