Hi Mark,
the JRC-acquis is extracted for legal documents, and in some case sentences
can be strange compared to the sentence that u can find in Europarl. They
can be very short, contain only numbers or references to some laws, ....
My suggestion is to create another test set and see how your algorithm
performs on the new test set.

Cheers
Marco

On Wed, Jun 9, 2010 at 2:00 PM, Mark Fishel <[email protected]> wrote:

> > Just in case, let me tell you that there seems to be several corpora (and
> > acquis) published by the JRC corpora. The one I was referring to in my
> previous
> > message can be downloaded here:
> http://langtech.jrc.it/DGT-TM.html#Download .
>
> The corpus that I meant is the JRC-Acquis Multilingual Parallel Corpus
> (http://langtech.jrc.it/JRC-Acquis.html).
>
> I wasn't talking so much about technical difficulties or corpus text
> bugs, and I'm aware of Koehn/Birch/Steinberger paper "462 Machine
> Translation Systems for Europe" -- but rather, has anyone had
> unexpected conclusions on this corpus, for instance something (like a
> method of improving the SMT output) that worked on, say, Europarl and
> didn't work on the same (or other) language pairs on the JRC-Acquis
> parallel corpus?
>
> Thanks in advance,
> Mark & Heiki
>
>
>
> >> Dear readers,
> >>
> >> we keep getting strange, unexpected and sometimes illogical results in
> >> more than one series of SMT experiments using the JRC Acquis parallel
> >> corpus. Often the same methods work fine on Europarl. Our question is
> >
> > Hi Mark,
> >
> > We have been using *extensively* the JRC acquis corpus and I can assure
> you that
> > we had no big problems. Some colleagues, who have used the program that
> comes
> > with the corpus, did have some slight problems. I have chosen to unzip
> the
> > several volumes manually and never had them. For this as well as for
> other
> > corpora, some characters can derail the training. We have developed Moses
> for
> > Mere Mortals (http://code.google.com/p/moses-for-mere-mortals/), that
> provides a
> > Windows add-in (Extract_TMX_Corpus) that helps to clean such things and
> creates
> > corpora that you can directly feed to Moses (UTF-8, Linux newlines,
> removal of
> > control characters and so on). Therefore, I can assure you that the JRC
> acquis
> > definitively works. It seems me that the Moses team has already published
> data
> > about their experiments with this corpus. It covers most, if not all, the
> > language pairs of the European Union, what is a plus.
> >
> > Greetings,
> >
> > João
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to