Re: tf(idf)-ing the cluster output

Viktor Gal Tue, 07 Feb 2012 09:10:13 -0800

Hey,

::: so, basically i've created a simple hadoop pipes that uses various image 
processing libraries (implemented in c/c++) to extract features of images (e.g. 
SIFT) that are stored on HDFS in sequencefiles. The output of this pipe is are 
pairs of <Text, VectorWritable> where text is the filename of the image and 
VectorWritable contains the extracted feature vector.
After this i run clustering on the acquired features, where i basically 
generate a codebook (the cluster centers) and encode the original feature 
vectors with this codebook. So this step basically will provide me the so 
called visual-words for the images.
once i have the output of clustering (visual-words) then i'm running the 
TF(IDF) vectorizer that i've written. TF = visual word frequency in the image, 
i.e. how many times a given cluster point is present in the given image. idf is 
analogous.... 
this method was first introduced in:
G. Csurka, C. Dance, L.X. Fan, J. Willamowski, and C. Bray (2004). "Visual 
categorization with bags of keypoints". Proc. of ECCV International Workshop on 
Statistical Learning in Computer Vision
http://www.xrce.xerox.com/content/download/20785/148346/file/2004_010.pdf

basically ever since this has become a standard method in computer vision. 
There are of course various modifications of it (especially how the clustering 
part are done). and of course for example opencv library has an implementation 
of it as well.

the reason i'm using mahout/hadoop is of course because i'm working with huge 
datasets.

After i've got the TF(IDF) vectors i'm using it for classification, e.g. 
classify natural images by their content by using machine learning algorithms.

basically that's it.

cheers,
viktor

On 7/02/2012, at 5:32 PM, Jeff Eastman wrote:

> Sure, love to hear more about your use case and pipeline. Can you describe 
> the steps you are performing and how the results get utilized?
> 
> Jeff
> 
> On 2/7/12 9:28 AM, Viktor Gal wrote:
>> Hi,
>> 
>> ::: i'm using mahout for computer vision, so my pipeline is a bit different 
>> from the text processing pipeline, i.e. after i've acquired the feature 
>> vectors i'm doing a clustering and after i've got the cluster centers and 
>> clustered the original feature vectors i'm doing the TF(IDF) vector 
>> calculation. This is a quite standard thing nowadays in computer vision...
>> 
>> so i've implemented the part for creating TF(IDF) vectors from the cluster 
>> output, based on DocumentVectorizer class. if anybody thinks that it'd be 
>> good to have this tool in mahout let me know so i'll create an issue for it 
>> JIRA and upload there my patches.
>> 
>> cheers,
>> viktor
>> 
>

Re: tf(idf)-ing the cluster output

Reply via email to