On Sat, Jan 23, 2010 at 9:53 PM, christopher taylor <christopher.paul.tay...@gmail.com> wrote: > there is an organization that is offering a service in which you can > translate SMS/text messages - the data is public domain - it is a > crowd source effort and it makes your ability to assist in haiti > something that you can do from home. > http://4636.ushahidi.com/search_post.php > the data is reviewed by responders and b/c geospatial data is attached > to the message, you can help gov'ts, NGO's, and aid groups provide > appropriate responses to areas in need of specific types of aid.
Yes, I was made aware of this SMS project yesterday or the day before, and took a look at the data. I'll give my feedback on this from a different angle. Imagine you were offered the content from thousands of unedited chat IM messages that were rapidly translated in urgent mode without any opportunity to ask questions, get context, and just to provide a content gisting draft message. This is exactly what all human translators complain about in the their discussion forums. Go read the discussion forums at Proz and Translatorscafe.com. Lack of context and understanding = inability to translate the content, even for a living human being with social knoweledge and the ability to make 2 + 2 = 5. Would you trust this IM chat content as training corpus for your baseline MT engine ? On the other hand, I just got off the phone with a Haitian Creole content provider and publisher who clearly agreed with what I wrote in a previous post about the state of Haitian Creole texts found on the web, and the massive amount of clean-up and editing work that is necessary to produce publishable content. If a Haitian-born expert in the Haitian Creole publications field confirms that unedited text found on the web is questionable as-is, then would you still use it? It might instead be better to use SMS content later as a fine-tuning mechanism to create spell-checkers, spelling normalizers, variable expression indicators and a number of derivative scripts and applications. But not as a baseline training corpus, especially when there is only 13,000 other translated sentences to start with (Still trying to see how much more can be made available). And please do not try to create an MT system based on all that unknown content and give it as-is to human Haitian Creole translators to start helping you improve the engine. This will simply reinforce the already very negative attitude of human translators toward MT and maintain MT's bad reputation. The users here are not grad student guinea pigs for research projects. They are people working on overstressed Disaster Relief projects in which the number of translation requests is sky-rocketing above the human translator bandwidth. I just saw a call for Creole human translators in that there are 200 messages for emergency services sitting in the queue with noone to translate them. It's like cyclists who are on the Tour de France climbing through the mountains. The sponsor promises a 27-speed bicycle with the gear ratio of cogs that can walk up a wall. And then delivers a 3-speed bicycle and says "Get ready, get said, go" with no info that it is a basic 3-speed. Or more closer to our situation. Take the analogy of the Tsunami in 2004. Would any of you have thrown together a Stat-MT system in a couple of days based upon the language resource databases for Indonesian that were available at that point in time? Kirti and Dion, you guys are the experts on Asian-language area MT needs. Given your expertise on this in real-world translation projects, what would you have done at that point in time with the content that was available back then. Sure, the Stat-MT methods have improved a bit, but the question here is content, type of content, quality of content, and if it would be appropriate to use it in such critical relief contexts based on a poor data set. I know what I think about it, based on experience. How about you guys and others? Let's provide a Haitian Creole system to real users based on a good training data set. This is about meeting critical communication needs. Within a few extra days, maybe a week or two of time, there could be a high enough volume of clean, quality content that could create a very good SMT based Creole engine. And now I'm explaining to this Creole content provider the types of licensing options that correspond with their massive amount of content that could be made available. It sure helps to have previously worked at the European Language Resources Association / Distribution Agency (ELRA/ELDA) and having given a couple of dozen presentations and talks at conferences on language data distribution issues and licensing schemes (talks/papers downloadable at my LinkedIn profile). This assists in providing the necessary advice for the kinds of content that are being considered for distribution. Jeff _______________________________________________ Mt-list mailing list