Hello all, Our names are AbdulRahman AlHamali and MHD Shaker Saad,
We've been using Hadoop for a while to run experiments on using Map Reduce to parallelize the training process of deep learning architectures, and we are, fortunately, reaching some very good results. However, there is this problem that has been bugging us for a while. In deep learning, you need to go through the data several times; for a given input training file, you need to repeat that file more than 10 times to reach the required accuracy of your deep learning structure. However, in Hadoop, we cannot repeat the file multiple times. Unless we either map and then reduce and then map and then reduce, etc. Or we physically repeat the file multiple times by copying its data (which we did eventually) which will cost a lot of storage. It would be great to be able to actually just go through the file n times where n is a parameter that we specify for the job. What We're thinking of is to create an interface that we can implement along with the input formats that would provide us with the ability to repeat the file multiple times. My friend and I are working on this add-on, but we need a mentor to just give us some general guidelines on how to help in the project and how to approach some of the specifics of the problem. In addition to guiding us through the formalties of the process. We believe that adding this feature would not only make Hadoop more suitable for training deep learning architectures but will also make it suitable for any application that requires going through a file multiple times before reducing. We would be thrilled if anyone of the experienced contributors can provide us with some mentoring. Thank you for your efforts. Regards,
