On 25 August 2011 00:09, Ted Dunning <[email protected]> wrote: > Praneet and I were just talking about a project he is working on to do with > higher-order learning methods such as boosting and feature sharding. This > is all pretty much in the context of classification and possibly clustering. > > The problems are: > > a) mahout doesn't have a general input format for classifiable data (this > has been discussed recently) > > b) hashed vector representations are not suitable for feature sharding since > individual features may be redundantly represented in many locations. > > c) mahout doesn't have a reasonable data structure for general data transfer > (related to -a-)
Re (c), Could Apache Pig's store/load subsystem be useful here? With possible side-benefit of making data on the same Hadoop cluster amenable to both Mahout and Pig-based hackery / analysis / scripting? Code is also already in the Apache universe, which reduces friction around licensing, Maven etc. http://pig.apache.org/docs/r0.9.0/func.html#load-store-functions http://pig.apache.org/docs/r0.9.0/func.html#pigdump http://pig.apache.org/docs/r0.9.0/func.html#pigstorage cheers, Dan
