Hello everybody, I'd like to discuss some issues with you regarding the 3rd layer of our proposed tuwoc-architecture: the feature extraction from the preprocessed crawled blog entries.
Currently we do a rather simple process: compute for each document TFIDF of all terms in the corpus. This is implemented straight-forward as a two-step map/reduce job. First a map job computes (and serializes to HBASE) TF histograms for each document. Then a reduce job computes the IDF of all terms occuring in the corpus and serializes the list of term/IDF pairs to HDFS. Finally, a third map job uses the serialized term/IDF pairs and TF histograms to compute a feature vector for each document. So basically, our feature space is the set of all term/IDF pairs. I currently see one major issue with this approach: our feature space - and thus our feature vectors - will probably get very large when many documents are scanned. This will obviously lead to the clustering being very slow. We probably will have to perform some kind of feature reduction during the feature extraction to get smaller - but still expressive - feature vectors. One idea would e.g. be to perform PCA on the "complete" feature vectors in order to identify dimensions that can be pruned. However, this might be computationally too expensive. Since I am not very experienced in this field, I hoped that some of you could share their thoughts or sugestions on the issue. Cheers, Max
