AbdulRahman AlHamali created MAPREDUCE-6453:
-----------------------------------------------
Summary: Repeatable Input File Format
Key: MAPREDUCE-6453
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6453
Project: Hadoop Map/Reduce
Issue Type: New Feature
Reporter: AbdulRahman AlHamali
Assignee: AbdulRahman AlHamali
Priority: Minor
We are interested in running the training process of deep learning
architectures on Hadoop clusters. We developed an algorithm that can carry out
this training process in a MapReduce fashion. However, there is still a problem
that we can improve.
In deep learning, training data is usually repeated multiple times (10 or even
more). However, we were not able to find a way to go through the input training
file multiple times without having to reduce first and then go back and then
map and reduce and so on so forth. So, to carry on the experiments, we were
forced to phyiscally repeat the files 10 or 20 times. This is not the best
solution, obviously, because first the file size is becoming much larger, and
second, it is not a neat way to carry out the job.
Thus, what we aim to do is to create an interface that input file formats can
implement that would provide them with the ability to repeat a file n times
before eventually reducing, which will solve the problem and make Hadoop more
suitable for the training of deep learning algorithms, or for such problems
that require going over the data multiple times before reducing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)