[
https://issues.apache.org/jira/browse/SINGA-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076541#comment-15076541
]
ASF subversion and git services commented on SINGA-97:
------------------------------------------------------
Commit 4cfe81373f25b4e4e6f76daf8983ebac3995388a in incubator-singa's branch
refs/heads/master from WANG Sheng
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=4cfe813 ]
SINGA-97 Add HDFS Store
minor change in makefile to make compile without hdfs correct
> SINGA-97 Add HDFS Store
> ------------------------
>
> Key: SINGA-97
> URL: https://issues.apache.org/jira/browse/SINGA-97
> Project: Singa
> Issue Type: New Feature
> Reporter: Anh Dinh
> Assignee: Anh Dinh
>
> This ticket implements HDFS Store for reading data from HDFS. It complements
> the existing CSV Store which reads data from CSV file. HDFS is the popular
> distributed file system with high (sequential) I/O throughputs, thus
> supporting it is necessary in order for SINGA to scale.
> The implementation will extend singa::io::Store class which is declared in
> `singa/io/store.h`. In particular, it will support the following I/O
> operations:
> + `bool Open(string& file, Mode mode)`
> + `bool Close()`
> + `bool Flush()`
> + `int Seek(int record_idx)`
> + `int Read(string *content)`
> + `int Write(string& content)`
> HDFS usage in SINGA is different to that in standard MapReduce applications.
> Specifically, each SINGA worker may train on sequences of records which do
> not lie within block boundary, whereas in MapReduce each Mapper process a
> number of complete blocks. In MapReduce, the runtime engine may fetch and
> cache the entire block over the network, knowing that the block will be
> processed entirely. In SINGA, such pre-fetching and caching strategy will be
> sub-optimal because it wastes I/O and network bandwidth on data records which
> are not used.
> We defer I/O optimization to a future ticket.
> For implementation, we choose `libhdfs3` from Pivotal for HDFS implementation
> in C++. This library is built natively for C++, hence it is more optimized
> and easier to deploy than the original `libhdfs` library that is shipped
> with Hadoop. Finally, we test the implementation in a distributed environment
> set up from a number of Docker containers (see SINGA-11).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)