You need to extend RandomSeedGenerator to take in a directory instead of
a file. Shouldn't have to make significant changes to KMeansDriver. I
have made the changes already (plus quite a few other things that I
would like to contribute) but I am currently stuck in getting clearance
from my company's Open Source Working Group =(
Adil
Wei Dong wrote:
Hi All,
I've successfully clustered sequence files with KMeansDriver, but I
haven't been able to pass directories of sequence files as input. I
have a huge dataset (~4TB) stored in about 8000 parts and it will cost
a lot of space simply to merge them into a single file. Do I need to
implement my own KMeansDriver?
Thanks a lot,
- Wei Dong