[Replies inline]
On Wed, 08 Nov 2017 13:25:59 +0100
Francis Tyers <fty...@prompsit.com> wrote:

> My question is: Is this ever the right thing to do ? I
> struggle to come up with use cases for this. I'm not
> sure how hard it would be to fix. But I thought I'd start
> a discussion.

I've been bitten by this, latest in previous WMT shared taks. It
seems to me that both in the common MT material from outside and in my
own corpora by far the most common and most useful "plain" text file
format is newline separated sentences (with double-newlines between
paragraphs and titles), usually in a format of a parallel corpus with
either line-by-line matches or some mapping between the lines. The
retaining of line-structure is crucial for operating on these files,
also with external tools, automatic evaluation and so forth. 

It seems to me that the default operation was meant for real plain text
that is formatted with perhaps constant line-width, like
gutenberg-corpus or wikipedia scrapings. Which is probably good for
what it is, but not what we usually work on.

I guess what is basically needed is file-formats and {de,re}formatters
for these two plain text formats. I think also that the command-line
should default the hard-line-break interpretation, the second form
is more like document translation or so, which is more commonly done
with graphical or web tools then.

As a compromise though, I would not mind a new light file format either,
although perhaps not that SGML, cause in reality we do often need to
sneak in metadata for most of the texts, it'd be a good thing to have a
proper way of doing this instead of relying on filenames and magic
characters. 

--
Doktor Tommi A Pirinen, Computational Linguist,
<https://flammie.github.io/purplemonkeydishwasher/>, Universität
Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
Entwickler.  President of ACL SIGUR SIG for Uralic languages
<http://gtweb.uit.no/sigur/>.
I tend to follow inline-posting style in desktop e-mail messages.



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to