[
https://issues.apache.org/jira/browse/SINGA-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076540#comment-15076540
]
ASF subversion and git services commented on SINGA-97:
------------------------------------------------------
Commit 9fbc8ee7aabbbdc2f76cdcccdf346e14d4544f1a in incubator-singa's branch
refs/heads/master from [~zhongle]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=9fbc8ee ]
SINGA-97 Add HDFS Store
Modify compilation files. Now as a user, one can build SINGA with hdfs support
by running:
./configure --enable-hdfs --with-libhdfs=/PATH/TO/HDFS3
--with-libhdfs is optional as by default the path is /usr/local/.wq
> SINGA-97 Add HDFS Store
> ------------------------
>
> Key: SINGA-97
> URL: https://issues.apache.org/jira/browse/SINGA-97
> Project: Singa
> Issue Type: New Feature
> Reporter: Anh Dinh
> Assignee: Anh Dinh
>
> This ticket implements HDFS Store for reading data from HDFS. It complements
> the existing CSV Store which reads data from CSV file. HDFS is the popular
> distributed file system with high (sequential) I/O throughputs, thus
> supporting it is necessary in order for SINGA to scale.
> The implementation will extend singa::io::Store class which is declared in
> `singa/io/store.h`. In particular, it will support the following I/O
> operations:
> + `bool Open(string& file, Mode mode)`
> + `bool Close()`
> + `bool Flush()`
> + `int Seek(int record_idx)`
> + `int Read(string *content)`
> + `int Write(string& content)`
> HDFS usage in SINGA is different to that in standard MapReduce applications.
> Specifically, each SINGA worker may train on sequences of records which do
> not lie within block boundary, whereas in MapReduce each Mapper process a
> number of complete blocks. In MapReduce, the runtime engine may fetch and
> cache the entire block over the network, knowing that the block will be
> processed entirely. In SINGA, such pre-fetching and caching strategy will be
> sub-optimal because it wastes I/O and network bandwidth on data records which
> are not used.
> We defer I/O optimization to a future ticket.
> For implementation, we choose `libhdfs3` from Pivotal for HDFS implementation
> in C++. This library is built natively for C++, hence it is more optimized
> and easier to deploy than the original `libhdfs` library that is shipped
> with Hadoop. Finally, we test the implementation in a distributed environment
> set up from a number of Docker containers (see SINGA-11).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)