A proposal for Repeatable input file format (MAPREDUCE-6453)

AbdulRahman AlHamali Mon, 17 Aug 2015 10:37:19 -0700

Hello all,

Our names are AbdulRahman AlHamali and MHD Shaker Saad,


We've been using Hadoop for a while to run experiments on using Map Reduce
to parallelize the training process of deep learning architectures, and we
are, fortunately, reaching some very good results.

However, there is this problem that has been bugging us for a while. In
deep learning, you need to go through the data several times; for a given
input training file, you need to repeat that file more than 10 times to
reach the required accuracy of your deep learning structure.
However, in Hadoop, we cannot repeat the file multiple times. Unless we
either map and then reduce and then map and then reduce, etc. Or we
physically repeat the file multiple times by copying its data (which we did
eventually) which will cost a lot of storage.

It would be great to be able to actually just go through the file n times
where n is a parameter that we specify for the job. What We're thinking of
is to create an interface that we can implement along with the input
formats that would provide us with the ability to repeat the file multiple
times.

My friend and I are working on this add-on, but we need a mentor to just
give us some general guidelines on how to help in the project and how to
approach some of the specifics of the problem. In addition to guiding us
through the formalties of the process.

We believe that adding this feature would not only make Hadoop more
suitable for training deep learning architectures but will also make it
suitable for any application that requires going through a file multiple
times before reducing.

We would be thrilled if anyone of the experienced contributors can provide
us with some mentoring.

Thank you for your efforts.

Regards,

A proposal for Repeatable input file format (MAPREDUCE-6453)

Reply via email to