[
https://issues.apache.org/jira/browse/IGNITE-7437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352274#comment-16352274
]
ASF GitHub Bot commented on IGNITE-7437:
----------------------------------------
GitHub user dmitrievanthony opened a pull request:
https://github.com/apache/ignite/pull/3472
IGNITE-7437 Fix javadoc in partition based dataset.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gridgain/apache-ignite ignite-7437-fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/ignite/pull/3472.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3472
----
commit 9624f00a1f9a33af87050d82b1682bbb7f842d08
Author: dmitrievanthony <dmitrievanthony@...>
Date: 2018-02-05T11:29:47Z
IGNITE-7437 Fix javadoc.
----
> Partition based dataset implementation
> --------------------------------------
>
> Key: IGNITE-7437
> URL: https://issues.apache.org/jira/browse/IGNITE-7437
> Project: Ignite
> Issue Type: New Feature
> Components: ml
> Reporter: Yury Babak
> Assignee: Anton Dmitriev
> Priority: Major
> Fix For: 2.5
>
>
> We want to implement our dataset based on entire partition instead of key
> sets.
>
> *A main idea behind the partition based datasets is the classic
> [MapReduce.|https://en.wikipedia.org/wiki/MapReduce]*
> The most important advantage of the MapReduce is an ability to perform
> computations on a data distributed across the cluster without involving
> significant data transmissions over the network. This idea is adopted in the
> partition based datasets in the following way:
> 1. Every dataset consists of partitions.
> 2. Partitions consists of a _context_ built on top of the Apache Ignite
> Cache and _recoverable data_ stored locally on every node.
> 3. Computations needed to be performed on a dataset splits on Map operations
> which executes on every partition and Reduce operations which reduces results
> of Map operations into one final result.
> _Why partitions have been selected as a building block of dataset and
> learning contain instead of cluster node?_
> One of the fundamental ideas of Apache Ignite Cache is that partitions are
> atomic, which means that they cannot be splitted between multiply nodes. As
> result in case of rebalancing or node failure partition will be recovered on
> another node with the same data it contained on the previous node.
> In case of machine learning algorithm it's very important because most of the
> ML algorithms are iterative and require some context maintained between
> iterations. This context cannot be split or merged and should be maintained
> in the consistent state during the whole learning process.
> *Another idea behind the partition based datasets is that we need to have
> data (in every partition) in
> [BLAS-|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms]like
> format as much as it possible.*
> [BLAS|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms] and
> [CUDA|https://en.wikipedia.org/wiki/CUDA] makes machine learning 100x faster
> and more reliable than algorithms based on self-written linear algebra
> subroutines and it means that not using BLAS is a recipe for disaster. In
> other words we need to keep data in BLAS-like format at any price.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)