Hi, I am working on applying sparse features for *phrase-based* system on *conversational *domain (e.g. SMS, Chat).
I used sparse features such as: TargetWordInsertionFeature, SourceWordDeletionFeature, WordTranslationFeature, PhraseLengthFeature. Sparse features are used only for top source and target words (100, 150, 200, 250, ....). My parallel data include: train(201K); tune(6214); test(641). My system configuration: tuning with MIRA, 5-gram LM with KenLM, others by default. Here is the result: *BLEU NIST METEOR* Baseline 0.2009 5.2175 0.2603 Baseline + SP100 0.2021 5.135 0.2645 Baseline + SP150 0.2048 5.1804 0.2653 Baseline + SP200 0.2093 5.2272 0.2671 Baseline + SP250 0.2148 5.2603 0.2680 Baseline + SP300 0.2146 5.2631 0.2680 (SP: sparse features) Although I got significantly improved result with SP250, I believe it was due to over-fitting problem. Then I tried to study the overlapping between train, tune and test data sets. The overlapping information is as follows: - *train & test*: *(based on source)* size of test set = 625 ( 641 with duplicates ) size of overlap set = 65 proportion of train set inside test set = 6394 / 201301 *(based on target)* size of test set = 621 ( 641 with duplicates ) size of overlap set = 69 proportion of training set inside test set = 13808 / 201301 - *tune & test* *(based on source)* size of test set = 625 ( 641 with duplicates ) size of overlap set = 624 proportion of tune set inside test set = 939 / 6214 *(based on target)* size of test set = 621 ( 641 with duplicates ) size of overlap set = 386 proportion of training set inside test set = 706 / 6214 (tune & test have high overlapping parts based on source sentences, but half of them have different target sentences) After filtering overlapping parts (based on source sentences) for train and tune based on test, my resulting parallel data include: train(194K); tune(5274); test(641). And here is the result: *BLEU NIST METEOR* Baseline 0.1990 5.1764 0.2589 Baseline + SP250 0.1967 5.0109 0.2606 Only METEOR got slightly improved, others were dropped remarkably. Is there any way to prevent over-fitting when applying the sparse features? Or in this case, sparse features will not generalize well over "unseen" data? I am seeking for your advise. Thanks so much! -- Cheers, Vu
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
