Dear community, 

I am running a clustering experiment for my project using various metrics to 
cluster fine art auction catalogue entries. 
Using domain knowledge, I have extracted certain features from the text to 
cluster. Alongside these specialist features, I would like to input some form 
of Tfidf metric that measures the semantics and vocabulary used relative to the 
rest of the vocabulary used in other entries as a corpus of words. 

The step I have taken so far: 

Text preprocessing and cleaning 
Tokenization
TFIDF vector processing with scikit
>From TFIDF scores of individual words calculated a mean TFIDF score across a 
>document (auction entry)

My feeling is that this might have some weaknesses (skew impact from longer 
documents in the corpus / or impact of high TDIDF scores from shorter 
documents) but could capture some of the rarer choices of vocabulary in a 
sentence relative to other entries. 

The idea behind the average TFIDF score is to get a document level capture of 
how unusual the vocabulary - in a specific document - is  is relative to the 
entire corpus of auction entries. 
I would be interested to hear if anyone has had experience with such a 
methodology or even any feedback on it. 

Does anyone in the group have any experience or thoughts on this as a 
sentence/document level information capture? 

Other word and sentence embeddings are clear alternatives but I favour TFIDF as 
it has the advantage of being specific to the vocabulary used in my domain 
specialist data set. 

Best wishes, 
Mathew
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to