Anh Dinh created SINGA-97:
-----------------------------

             Summary: SINGA-97 Add HDFS support 
                 Key: SINGA-97
                 URL: https://issues.apache.org/jira/browse/SINGA-97
             Project: Singa
          Issue Type: New Feature
            Reporter: Anh Dinh
            Assignee: Anh Dinh


This ticket implements HDFS Store for reading data from HDFS. It complements 
the existing CSV Store which reads data from CSV file. HDFS is the popular 
distributed file system with high (sequential) I/O throughputs, thus supporting 
it is necessary in order for SINGA to scale. 

The implementation will extend singa::io::Store class which is declared in 
`singa/io/store.h`. In particular, it will support the following I/O operations:

+ `bool Open(string& file, Mode mode)`
+ `bool Close()`
+ `bool Flush()`

+ `int Seek(int record_idx)`
+ `int Read(string *content)`
+ `int Write(string& content)`

HDFS usage in SINGA is different to that in standard MapReduce applications. 
Specifically, each SINGA worker may train on sequences of records which do not 
lie within block boundary, whereas in MapReduce  each Mapper process a number 
of complete blocks.  In MapReduce, the runtime engine may fetch and cache the 
entire block over the network, knowing that the block will be processed 
entirely. In SINGA, such pre-fetching and caching strategy will be sub-optimal 
because it wastes I/O and network bandwidth on data records which are not used. 

We defer I/O optimization to a future ticket. 

For implementation, we choose `libhdfs3` from Pivotal for HDFS implementation 
in C++. This library is built natively for C++, hence it is more optimized and 
easier to deploy than the original  `libhdfs` library that is shipped with 
Hadoop. Finally, we test the implementation in a distributed environment set up 
from a number of  Docker containers (see SINGA-11). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to