The file format for meta data from the ARFF integration is pretty
ad-hoc. I'll use it for now and push it into the RF classifiers.
It might be nice to change the format to a more widely used
serialization format like JSON or AVRO or whatever. The problem is I
don't know what other parts of the application consume these files...
On 02/28/2013 01:14 PM, Ted Dunning wrote:
making this consistent would be very helpful.
On Thu, Feb 28, 2013 at 9:33 AM, Marty Kube <[email protected]>wrote:
Hey,
I've been looking at consuming ARFF files for random forest classification.
If you look at the partial implementation example page one is asked to
download an ARFF file, edit the ARFF file to remove the meta-data, and then
recreate the same meta-data with command line arguments to the Describe
utility (plus a scan of the data to find enumerated values). I thought it
would be much nicer if we could just read the meta-data from the ARFF file.
I've been using the ARFF integration which generates a meta-data file and
sequence file with vectors from a ARFF file. The plan is to then read the
meta-data and sequence file in the RF classifiers.
So here is my question. The random forest classifiers use an binary file
format for the metadata (generated by
org.apache.mahout.classifier.**df.tools.Describe).
The ARFF integration writes the meta-data in a different format. Is
there a need to support both formats in the RF classifiers? I was thinking
it might be best to modify df.tools.Describe to generate/read the same
format as the ARFF integration (org.apache.mahout.utils.**vectors.arff.Driver).
Does that sound like a reasonable plan?
Marty