[jira] [Commented] (IGNITE-7437) Partition based dataset implementation

Peter Ivanov (JIRA) Mon, 05 Feb 2018 01:52:04 -0800

    [ 
https://issues.apache.org/jira/browse/IGNITE-7437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352197#comment-16352197
 ]


Peter Ivanov commented on IGNITE-7437:
--------------------------------------

Javadoc build broken:
{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-antrun-plugin:1.7:run 
(javadoc-postprocessing-new) on project apache-ignite: An Ant BuildException 
has occured: Execution failed due to: Class doesn't have description in file: 
/Users/vveider/Development/VCS/Git/ignite/target/javadoc/core/org/apache/ignite/ml/preprocessing/PreprocessingTrainer.html
[ERROR]
[ERROR] around Ant part ...<doctask css="dotted" dir="target/javadoc/core">... 
@ 11:51 in 
/Users/vveider/Development/VCS/Git/ignite/target/antrun/build-main.xml
[ERROR] -> [Help 1]
{code}

> Partition based dataset implementation
> --------------------------------------
>
>                 Key: IGNITE-7437
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7437
>             Project: Ignite
>          Issue Type: New Feature
>          Components: ml
>            Reporter: Yury Babak
>            Assignee: Anton Dmitriev
>            Priority: Major
>             Fix For: 2.5
>
>
> We want to implement our dataset based on entire partition instead of key 
> sets.
>  
> *A main idea behind the partition based datasets is the classic 
> [MapReduce.|https://en.wikipedia.org/wiki/MapReduce]*
> The most important advantage of the MapReduce is an ability to perform 
> computations on a data distributed across the cluster without involving 
> significant data transmissions over the network. This idea is adopted in the 
> partition based datasets in the following way:
> 1. Every dataset consists of partitions.
>  2. Partitions consists of a _context_ built on top of the Apache Ignite 
> Cache and _recoverable data_ stored locally on every node.
>  3. Computations needed to be performed on a dataset splits on Map operations 
> which executes on every partition and Reduce operations which reduces results 
> of Map operations into one final result.
> _Why partitions have been selected as a building block of dataset and 
> learning contain instead of cluster node?_
> One of the fundamental ideas of Apache Ignite Cache is that partitions are 
> atomic, which means that they cannot be splitted between multiply nodes. As 
> result in case of rebalancing or node failure partition will be recovered on 
> another node with the same data it contained on the previous node.
> In case of machine learning algorithm it's very important because most of the 
> ML algorithms are iterative and require some context maintained between 
> iterations. This context cannot be split or merged and should be maintained 
> in the consistent state during the whole learning process.
> *Another idea behind the partition based datasets is that we need to have 
> data (in every partition) in 
> [BLAS-|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms]like 
> format as much as it possible.*
> [BLAS|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms] and 
> [CUDA|https://en.wikipedia.org/wiki/CUDA] makes machine learning 100x faster 
> and more reliable than algorithms based on self-written linear algebra 
> subroutines and it means that not using BLAS is a recipe for disaster. In 
> other words we need to keep data in BLAS-like format at any price.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-7437) Partition based dataset implementation

Reply via email to