Hi,

I am working on applying sparse features for *phrase-based* system on
*conversational
*domain (e.g. SMS, Chat).

I used sparse features such as: TargetWordInsertionFeature,
SourceWordDeletionFeature, WordTranslationFeature, PhraseLengthFeature.
Sparse features are used only for top source and target words (100, 150,
200, 250, ....).

My parallel data include: train(201K); tune(6214); test(641).
My system configuration: tuning with MIRA, 5-gram LM with KenLM, others by
default.

Here is the result:

*BLEU NIST METEOR*
Baseline 0.2009 5.2175 0.2603
Baseline + SP100 0.2021 5.135 0.2645
Baseline + SP150 0.2048 5.1804 0.2653
Baseline + SP200 0.2093 5.2272 0.2671
Baseline + SP250 0.2148 5.2603 0.2680
Baseline + SP300 0.2146 5.2631 0.2680
(SP: sparse features)

Although I got significantly improved result with SP250, I believe it was
due to over-fitting problem.
Then I tried to study the overlapping between train, tune and test data
sets.
The overlapping information is as follows:
- *train & test*:
*(based on source)*
size of test set = 625 ( 641 with duplicates )
size of overlap set = 65
proportion of train set inside test set = 6394 / 201301
*(based on target)*
size of test set = 621 ( 641 with duplicates )
size of overlap set = 69
proportion of training set inside test set = 13808 / 201301

- *tune & test*
*(based on source)*
size of test set = 625 ( 641 with duplicates )
size of overlap set = 624
proportion of tune set inside test set = 939 / 6214
*(based on target)*
size of test set = 621 ( 641 with duplicates )
size of overlap set = 386
proportion of training set inside test set = 706 / 6214

(tune & test have high overlapping parts based on source sentences, but
half of them have different target sentences)

After filtering overlapping parts (based on source sentences) for train and
tune based on test, my resulting parallel data include: train(194K);
tune(5274);
test(641).

And here is the result:

*BLEU NIST METEOR*
Baseline 0.1990 5.1764 0.2589
Baseline + SP250 0.1967 5.0109 0.2606

Only METEOR got slightly improved, others were dropped remarkably.

Is there any way to prevent over-fitting when applying the sparse features?
Or in this case, sparse features will not generalize well over "unseen"
data?
I am seeking for your advise.

Thanks so much!

--
Cheers,
Vu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to