The folder contains files with text and other folders with text files. The text is not key/value, it's just text. Something like this: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dumm...
I'm thinking about 3 options: First. To use Hadoop Streaming as it's proposed here http://stackoverflow.com/questions/7153087/hadoop-compress-file-in-hdfs by Jeff Wu Second. To use a custom map/reduce task. Using as a map the IdentityMapper and a custom reducer that creates the zip file, but i'm not sure if in the reducer I'll have info about the parent folders, maybe with a custom mapper. Something similar to https://github.com/flopezluis/testing-hadoop/blob/master/src/pruebas/Reduce.java Third option is to create a new Hdfs command to zip in hadoop, but i'm not sure whether hadoop distributes the execution, because otherwise it may takes a long time and very cpu consuming. Any ideas? Thanks