Re: [Moses-support] News monolingual corpus question

Vincent Nguyen Wed, 05 Oct 2016 02:01:35 -0700

Thank Barry,

Actually I was trying to 1) replicate the 1 billion word benchmark 
language model 2) trying to update these results with more recent data.


So technically this is not going to be very easy with most recent 
version of the data, but as you say, the WMT11 were not dedup.

Anyway, I'll figure out something, but it was for clarification since my 
word word counts was way off.

Thanks.


Le 05/10/2016 à 10:46, Barry Haddow a écrit :
> Hi Vincent
>
> I think at some point we re-extracted all previous years. One possible 
> reason for the difference is that now we are de-duping, and before we 
> didn't.
>
> I would say if you want to compare to recent WMT experiments, take the 
> most recent version of the data,
>
> cheers - Barry
>
> On 04/10/16 21:34, Vincent Nguyen wrote:
>>
>> ok
>> this one http://www.statmt.org/wmt11/training-monolingual.tgz
>> includes ( I think)
>> http://www.statmt.org/wmt11/training-monolingual-news-2010.tgz
>> but if I extract news.2010.en.shuffled it is unzipped 2051344 Ko
>> (all above from WMT11 page)
>>
>> on this link :
>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2010.en.shuffled.gz
>>  
>>
>> (from the WMT15 page)
>> unzipped it gives 807761 Ko
>>
>> 2010 is just an example, years are all different.
>>
>>
>> Le 04/10/2016 à 22:24, Barry Haddow a écrit :
>>> Hi Vincent
>>>
>>> Could you say exactly which files you are comparing?
>>>
>>> cheers - Barry
>>>
>>> On 04/10/16 21:20, Vincent Nguyen wrote:
>>>>
>>>> no.... but my mistake I was comparing with that link for the per 
>>>> year files : http://www.statmt.org/wmt15/translation-task.html
>>>>
>>>> what is the difference ? (with the wmt11 files)
>>>>
>>>>
>>>>
>>>> Le 04/10/2016 à 21:46, Barry Haddow a écrit :
>>>>> Hi Vincent
>>>>>
>>>>> Are you comparing compressed with uncompressed files?
>>>>>
>>>>> cheers - Barry
>>>>>
>>>>> On 04/10/16 14:40, Vincent Nguyen wrote:
>>>>>> Hi,
>>>>>>
>>>>>> on this link:
>>>>>>
>>>>>> http://www.statmt.org/wmt11/translation-task.html
>>>>>>
>>>>>> on the download section for monolingual data, there is :
>>>>>>
>>>>>> one big file : http://www.statmt.org/wmt11/training-monolingual.tgz
>>>>>>
>>>>>> And separate files, of which news crawls per year.
>>>>>>
>>>>>> However, when you take a single file for a specific year, it is 
>>>>>> not the
>>>>>> same size as the same name file in the big download.
>>>>>>
>>>>>> expanded size for english corpus :
>>>>>>
>>>>>> news2008: 4.3GB vs 1.6GB for single download
>>>>>> news2009: 5.3GB vs 1.8GB for single download
>>>>>>
>>>>>> etc...
>>>>>>
>>>>>> can someone please explain the difference ?
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> Vincent.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] News monolingual corpus question

Reply via email to