The folder contains files with text and other folders with text files. The
text is not key/value, it's just text. Something like this:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dumm...

I'm thinking about 3 options:

First. To use Hadoop Streaming as it's proposed here
http://stackoverflow.com/questions/7153087/hadoop-compress-file-in-hdfs by
Jeff Wu

Second. To use a custom map/reduce task. Using as a map the IdentityMapper
and a custom reducer that creates the zip file, but i'm not sure if in the
reducer I'll have  info about the parent folders, maybe with a custom
mapper. Something similar to
https://github.com/flopezluis/testing-hadoop/blob/master/src/pruebas/Reduce.java

Third option is to create a new Hdfs command to zip in hadoop, but i'm not
sure whether hadoop distributes the execution, because otherwise it may
takes a long time and very cpu consuming.

Any ideas?

Thanks

Reply via email to