You actually can specify a directory as the initial input, without any
hassle or changes. If you are on trunk, you can check
testKMeansMRJob() method on TestKMeansClustering.java. It creates two
files and store them in a single directory which is then defined as
the input:

...
ClusteringTestUtils.writePointsToFile(points, "testdata/points/file" +
i, fs, conf);
ClusteringTestUtils.writePointsToFile(points, "testdata/points/file2",
fs, conf);
...
   KMeansDriver.runJob("testdata/points", "testdata/clusters", "output",
          EuclideanDistanceMeasure.class.getName(), 0.001, 10, 2,
SparseVector.class);
...

You can tell the Driver recognizes all your files because a message such as

 "INFO mapred.FileInputFormat: Total input paths to process :
<number-of-input-files>"

will be output to stdout.

Same goes if you are running the job from a terminal. Just dfs -put
your files inside a directory and run the driver with -i flag pointing
to that directory.


On Wed, Jul 29, 2009 at 5:37 PM, Adil Aijaz<[email protected]> wrote:
> You need to extend RandomSeedGenerator to take in a directory instead of a
> file. Shouldn't have to make significant changes to KMeansDriver. I have
> made the changes already (plus quite a few other things that I would like to
> contribute) but I am currently stuck in getting clearance from my company's
> Open Source Working Group =(
>
> Adil
>
> Wei Dong wrote:
>>
>> Hi All,
>>
>> I've successfully clustered sequence files with KMeansDriver, but I
>> haven't been able to pass directories of sequence files as input.  I have a
>> huge dataset (~4TB) stored in about 8000 parts and it will cost a lot of
>> space simply to merge them into a single file.  Do I need to implement my
>> own KMeansDriver?
>>
>> Thanks a lot,
>>
>> - Wei Dong
>>
>
>

Reply via email to