[Replies inline] On Wed, 08 Nov 2017 13:25:59 +0100 Francis Tyers <fty...@prompsit.com> wrote:
> My question is: Is this ever the right thing to do ? I > struggle to come up with use cases for this. I'm not > sure how hard it would be to fix. But I thought I'd start > a discussion. I've been bitten by this, latest in previous WMT shared taks. It seems to me that both in the common MT material from outside and in my own corpora by far the most common and most useful "plain" text file format is newline separated sentences (with double-newlines between paragraphs and titles), usually in a format of a parallel corpus with either line-by-line matches or some mapping between the lines. The retaining of line-structure is crucial for operating on these files, also with external tools, automatic evaluation and so forth. It seems to me that the default operation was meant for real plain text that is formatted with perhaps constant line-width, like gutenberg-corpus or wikipedia scrapings. Which is probably good for what it is, but not what we usually work on. I guess what is basically needed is file-formats and {de,re}formatters for these two plain text formats. I think also that the command-line should default the hard-line-break interpretation, the second form is more like document translation or so, which is more commonly done with graphical or web tools then. As a compromise though, I would not mind a new light file format either, although perhaps not that SGML, cause in reality we do often need to sneak in metadata for most of the texts, it'd be a good thing to have a proper way of doing this instead of relying on filenames and magic characters. -- Doktor Tommi A Pirinen, Computational Linguist, <https://flammie.github.io/purplemonkeydishwasher/>, Universität Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D Entwickler. President of ACL SIGUR SIG for Uralic languages <http://gtweb.uit.no/sigur/>. I tend to follow inline-posting style in desktop e-mail messages. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff