Can we make the file descriptor as following: 1. make a small csv file with the same format as the actual dataset, say a CSV file with header and only one record, 2. Use java weka.core.converters.CSVLoader filename.csv > filename.arff to convert the small CSV into a ARFF file, see http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html 3. Use org.apache.mahout.df.tools.Describe to generate a descriptor
The only consern here is: does the small CSV file with one record sufficient enough to generate the ARFF file header, or do we have to use the whole file to avoid losing information? Xiaobo Gu On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <[email protected]> wrote: > But if we use CSV files, how can we generate descriptors for datasets? > > Cheers > > Xiaobo Gu > > On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <[email protected]> > wrote: >> I guess yes. as long as you don't use quotes or double quotes to embed the >> fields. >> >> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <[email protected]> wrote: >> >>> So for simple datasets, which only have numeric and character >>> lable(without blank) category columns, can we just use CSV tools to >>> save it as a standard CSV file without header? >>> >>> >>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <[email protected]> >>> wrote: >>> > the current implementation doesn't support the ARFF format >>> out-of-the-box, >>> > as described in the Wiki you need to remove the header of the file and >>> leave >>> > only the data. Actually, this implementation is fully compatible with >>> UCI's >>> > datasets which are comma separated text files. You'll also need to call >>> the >>> > dataset description tool (see the wiki) in order to generate a proper >>> > description file (contains the nature of each attribute: Numerical or >>> > Categorical). >>> > >>> > Yes you can use BuildForest and TestForest to generate and use Random >>> forest >>> > models from the command line >>> > >>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <[email protected]> >>> wrote: >>> > >>> >> Hi, >>> >> >>> >> The Random Forest partial implementation in >>> >> >>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation >>> >> use the ARFF file format, is ARFF the only supportted file format when >>> >> using the BuildForest and TestForest program, and are BuildForest and >>> >> TestForest program are official tools to build Random Forest models >>> >> from the command line? >>> >> >>> >> Regards, >>> >> >>> >> Xiaobo Gu >>> >> >>> > >>> >> >
