[ 
https://issues.apache.org/jira/browse/SINGA-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076540#comment-15076540
 ] 

ASF subversion and git services commented on SINGA-97:
------------------------------------------------------

Commit 9fbc8ee7aabbbdc2f76cdcccdf346e14d4544f1a in incubator-singa's branch 
refs/heads/master from [~zhongle]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=9fbc8ee ]

SINGA-97 Add HDFS Store

Modify compilation files. Now as a user, one can build SINGA with hdfs support 
by running:
        ./configure --enable-hdfs --with-libhdfs=/PATH/TO/HDFS3
--with-libhdfs is optional as by default the path is /usr/local/.wq


> SINGA-97 Add HDFS Store 
> ------------------------
>
>                 Key: SINGA-97
>                 URL: https://issues.apache.org/jira/browse/SINGA-97
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: Anh Dinh
>            Assignee: Anh Dinh
>
> This ticket implements HDFS Store for reading data from HDFS. It complements 
> the existing CSV Store which reads data from CSV file. HDFS is the popular 
> distributed file system with high (sequential) I/O throughputs, thus 
> supporting it is necessary in order for SINGA to scale. 
> The implementation will extend singa::io::Store class which is declared in 
> `singa/io/store.h`. In particular, it will support the following I/O 
> operations:
> + `bool Open(string& file, Mode mode)`
> + `bool Close()`
> + `bool Flush()`
> + `int Seek(int record_idx)`
> + `int Read(string *content)`
> + `int Write(string& content)`
> HDFS usage in SINGA is different to that in standard MapReduce applications. 
> Specifically, each SINGA worker may train on sequences of records which do 
> not lie within block boundary, whereas in MapReduce  each Mapper process a 
> number of complete blocks.  In MapReduce, the runtime engine may fetch and 
> cache the entire block over the network, knowing that the block will be 
> processed entirely. In SINGA, such pre-fetching and caching strategy will be 
> sub-optimal because it wastes I/O and network bandwidth on data records which 
> are not used. 
> We defer I/O optimization to a future ticket. 
> For implementation, we choose `libhdfs3` from Pivotal for HDFS implementation 
> in C++. This library is built natively for C++, hence it is more optimized 
> and easier to deploy than the original  `libhdfs` library that is shipped 
> with Hadoop. Finally, we test the implementation in a distributed environment 
> set up from a number of  Docker containers (see SINGA-11). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to