Hey, I've been looking at consuming ARFF files for random forest classification.
If you look at the partial implementation example page one is asked to download an ARFF file, edit the ARFF file to remove the meta-data, and then recreate the same meta-data with command line arguments to the Describe utility (plus a scan of the data to find enumerated values). I thought it would be much nicer if we could just read the meta-data from the ARFF file.
I've been using the ARFF integration which generates a meta-data file and sequence file with vectors from a ARFF file. The plan is to then read the meta-data and sequence file in the RF classifiers.
So here is my question. The random forest classifiers use an binary file format for the metadata (generated by org.apache.mahout.classifier.df.tools.Describe). The ARFF integration writes the meta-data in a different format. Is there a need to support both formats in the RF classifiers? I was thinking it might be best to modify df.tools.Describe to generate/read the same format as the ARFF integration (org.apache.mahout.utils.vectors.arff.Driver). Does that sound like a reasonable plan?
Marty
