I wrote a CSV Jute record parser in python, and thought some people on the list might also be interested.
http://github.com/ptarjan/hadoop_record You can use it in your streaming jobs with -inputformat SequenceFileAsTextInputFormat -file hadoop_record.mod And just showing some features: >>> from hadoop_record import csv >>> csv("T") True >>> csv(";-1234") -1234 >>> csv("1.0E-10") 1e-10 >>> csv("s{T,F}") [True, False] >>> csv("v{T,F}") [True, False] >>> csv("v{s{T,F}}") [[True, False]] >>> csv("m{'don't,#73746f70}") {LazyString("don't"): LazyString('stop')} >>> csv("'\xe2\x98\x83") LazyString('\xe2\x98\x83') >>> str(csv("'\xe2\x98\x83")) '\xe2\x98\x83' >>> unicode(csv("'\xe2\x98\x83")) u'\u2603' >>> csv("'%00%0a%25%2c") LazyString('\x00\n%,') The LazyString was needed because I was spending most of my CPU just decoding data from the Jute record that I didn't care about. It shouldn't get in your way too much, as long as you cast it to a str first. So let me know what you think. For bugs, fork, fix, and then send me a pull request (or use the issues tracker). Paul
