On Friday, 2013-03-22, Josh Wills wrote: > I'm working on some tools for doing data integration and building machine > learning models w/Crunch, Mahout, and (soon!) HCatalog, and I wrote about > what I'm up to here: > > http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/ > > and the code is here: https://github.com/cloudera/ml
Cool thing, thanks for open sourcing it! [...] > Q: Why not do this as part of the Crunch or Mahout projects? > A: Dependency management. Crunch doesn't depend on Mahout, and Mahout > doesn't depend on Crunch, and I think that for the sanity of the developers > of both projects, it should stay that way. Dependency management is already > enough of a nightmare for Hadoop projects that I didn't want to do anything > to make it worse. I will contribute anything from the toolkit back to > Crunch that is deemed useful by the community (e.g., the reservoir sampling > stuff in CRUNCH-178) and doesn't introduce any new dependencies. This is really sad - but most probably the best decision for now. Do you happen to know if there is any work planned on the Hadoop side to clean up this situation? Regards, Matthias
