While reading through the wiki and article material on mahout, I noticed that there was a pre-generation step where vectors were being generated from either text with Lucene or ARFF with org.apache.mahout.utils.vectorsarff.driver.java; Looking at the k-means driver and mapper (KMeansMapper.java) I noticed that the mapper is taking a key and then a Vector (point) as input.
Would it be smart or practical to make a special record reader for your file format that read your data in as vectors directly and emitted vectors to the mapper in order to skip the pre-generation step? Just curious about that, maybe I'm missing something there, or vectorization would be cumbersome in that position, etc. Also, in Grant's article on Mahout he includes the vectorized 2.5 GB file from Wikipedia that is in the correct format via Lucene to work with a Mahout clustering algorithm; Is there a smaller (sub 100 meg) version of this that I could play around with? I'm working with basic building blocks right now and figuring out the facets of vectorization with respect to Mahout so we can learn the base case (lucene vectors) and then move on to our specific case (sensor time series data). Josh Patterson TVA
