Forgot to cc the list, sorry
---------- Forwarded message ----------
From: Jacob Nordfalk <[email protected]>
Date: 2012/3/22
Subject: Re: GSoC - Make lttoolbox-java embeddable
To: Mikel Artetxe <[email protected]>
2012/3/22 Mikel Artetxe <[email protected]>
>
>> Hi,
>>
>> Thank you for your answer. I have completed the coding challenge (at
>> least the first task that you proposed me), and I am sending you both the
>> demo JAR file and the zipped source code attached to this email.
>>
>
>
$ echo "Artikulu hau hobe dezakezu... badakizu nola?" | java -jar
apertium-eu-en.jar
This article you can improve... If he is how?
Great work!
(although the Basqe translator could need some polish :-)
> Regarding this, I agree with you that we should work on settling more
>> specific goals together. I have been thinking about it and I have
>> identified some general points that I think that we should consider:
>>
>
> *- Provide a friendly way for external programs that use the JAR as a
>> library to interact with it*. The current code is designed thinking on
>> final users that interact from the command line, not on third part
>> applications that would use Apertium as an external library for
>> translation. I think that translating something for an external program
>> should be as easy as making a single method call that would take the text
>> to be translated as an argument and return the translated text. This way,
>> we would be offering a powerful translation library that any programmer
>> could use in a really simple way, even if he/she doesn't know anything
>> about machine translation or Apertium.
>>
>
Sure, make an API class. There are some APIs around that it
should probably try to resemple (C++, webservice, ...). I'm not against it.
An API that allows acces to the intermediary translation stages, like
Apertium-viewer would need, would be interesting for Apertium-viewer.
Here's another aspect, which I want to share with you about the loading of
resources - as it is inevitably related to how resources are packaged:
When translating short sentences the Java port has the disadvantage that it
takes about 1/3 of a second to load.
This is mainly due to the fact that the binary dictionaries (.bin files)
have to be loaded.
The method is here:
void read(InputStream input, Alphabet alphabet)
http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox-java/src/org/apertium/lttoolbox/process/TransExe.java?view=log
(BTW Please see also C++ version - as you can see they are really similar
http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox/lttoolbox/trans_exe.cc?revision=6607&view=markup
)
As you can see in the code this is mainly doing a lot of calls to
multibyte_read
(which essentially is a UTF-8 decoder) and converting input to a lot of
Node objects each having an array of pointers to other Node objects. During
translation there is a state machine jumping around in these nodes.
So.... basically I am toying with the idea of NOT creating Node objects
etc, but just representing it all in one big array of chars or ints. An
array which could be read directly from disk.
This will bring loading CPU time and startup time down considerably, both
for C++ and Java version.
And... we have had some brief discussions about memmapping techniques,
which could, in principle, allow us to just load the parts of the array we
need into memory, thus further reducing startup time.
So, just for consideration, there is an option that compiled dictionaries
could lie on the disk as 'raw memory chunks' which would be loaded whenever
needed (memmap).
(That is, if we agreee that this is an interesting path to explore.)
>
>> *- Offer an easy mechanism to create JAR files for any language pair.*In my
>> solution for the coding challenge the language pair to be translated
>> is hardcoded, and the resources have been manually copied to their
>> corresponding location. However, considering the amount of language pairs
>> that Apertium supports such solution is not maintainable at all, and some
>> easier system should be offered. For instance, I have thought that a
>> library JAR file by itself could be executable at the same time and, when
>> running it, it would display a simple interface to manage the language
>> pairs that it has embedded (remove them or add new ones). This way, we
>> would only have to care about a single application, and not dozens of them.
>> Already prepared JARs could be offered as well to make it simpler for
>> programmers that don't know about the internals of Apertium.
>>
>
I've taken a look at the JAR file you sent and deleted classes that are not
used during runtime.
It turns out that your cleaned eu-en pair JAR file is 1.8 MB and if I
delete lttoolbox-java it is 1.6 MB.
I'd say that it would also be OK to just have that extra 0.2 MB
that lttoolbox-java consists of in each pair.
>
>> *- Develop a good mechanism to access embedded resources*. As said
>> before, we should decide whether to use temporary files or partially
>> rewrite the current code to directly use the embedded resources.
>> *
>> - Additionally, I could work on some application that would use the
>> embedded JAR*. This way, final users would have an easy to use program
>> that requires no installation at all. At the same time, it could be a good
>> example for programmers to see how they could use Apertium as a library for
>> their own projects.
>
>
That's a good idea.
> I could also write some tutorial regarding that. Depending on the
>> available time, they type of application could vary: making a simple Swing
>> application would be very easy,
>
>
To make and to understand and modify (if you make it really simple) .
That's why I don't think an API is that important.
> but an Android app would require more time (although I have experience on
>> mobile development for iOS I haven't developed anything for Android, but it
>> shouldn't be too difficult for me to learn it, I guess).
>>
>
Fine, as long as the scope is embeddability, and not making sophisticated
GUIs.
Anything beyond the simplest app would be for
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Apertium_on_your_mobile
. On the other hand, trying to deploy on Android is relevant as a test of
embedability and I think you could try together on Android.
You'd immedially have the problem that you should invoke a special Android
method (getCacheDir()) to know where you can put temporary files
(System.getProperty("java.io.tmpdir") won't work)
Now, we don't know how many GSoC slots we have and if both (or none of) the
projects gets a GSoC slot. Lets see what happens and then decide.
>
>>
>> Some important points are probably missing, and we can discuss further
>> about the ones that I have mentioned. I am very opened in that sense. As
>> for the coding challenge, I can work on improving my solution (for
>> instance, developing a simple GUI as you suggested in your email), but I
>> think that your feedback would be very interesting for me before going for
>> that...
>>
>>
I haven't really got much more feedback to give. Nice work!
(one suggestion below)
If I didnt miss anything you changes the eu-en.modes to
/usr/local/bin/lt-proc PATH_TO_TEMP_DIR/eu-en.automorf.bin
|/usr/local/bin/apertium-tagger -g $2 PATH_TO_TEMP_DIR/eu-en.prob
|/usr/local/bin/apertium-pretransfer|/usr/local/bin/apertium-transfer -n
PATH_TO_TEMP_DIR/apertium-eu-en.colloc.t1x
PATH_TO_TEMP_DIR/eu-en.colloc.t1x.bin |/usr/local/bin/apertium-transfer -n
PATH_TO_TEMP_DIR/apertium-eu-en.ordinals.t1x
PATH_TO_TEMP_DIR/eu-en.ordinals.t1x.bin |/usr/local/bin/apertium-transfer
PATH_TO_TEMP_DIR/apertium-eu-en.eu-en.t1x PATH_TO_TEMP_DIR/eu-en.t1x.bin
PATH_TO_TEMP_DIR/eu-en.autobil.bin |/usr/local/bin/apertium-interchunk
PATH_TO_TEMP_DIR/apertium-eu-en.eu-en.t2x PATH_TO_TEMP_DIR/eu-en.t2x.bin
|/usr/local/bin/apertium-postchunk
PATH_TO_TEMP_DIR/apertium-eu-en.eu-en.t3x PATH_TO_TEMP_DIR/eu-en.t3x.bin
|/usr/local/bin/lt-proc $1 PATH_TO_TEMP_DIR/eu-en.autogen.bin
|/usr/local/bin/lt-proc -p PATH_TO_TEMP_DIR/eu-en.autopgen.bin
and wrote the class
package org.apertium;
import org.apertium.pipeline.ApertiumMain;
import static org.apertium.utils.IOUtils.readFile;
import static org.apertium.utils.IOUtils.writeFile;
import java.io.*;
import java.util.Enumeration;
import java.util.UUID;
import java.util.zip.*;
/**
*
* @author Mikel Artetxe
*/
public class CommandLineInterface {
public static final String PACKAGE_VERSION = "3.2j";
/**
* Extracts the embedded resources inside the JAR file at the directory
specified
* by path to a temporary folder
* @param path The path of the directory inside the JAR file in which
the embedded resources are located
* @return The temporary folder that has been created
* @throws IOException Exception thrown if something goes wrong while
writing files
*/
public static File extractResources(String path) throws IOException {
//We get the ZIP file that corresponds to the running JAR file
//(keep in mind that JAR files are renamed ZIP files after all)
ZipFile jar = new
ZipFile(CommandLineInterface.class.getProtectionDomain().getCodeSource().getLocation().getFile());
//Since the creation of temporary directories is not natively
supported by Java until
//its 7th version, we get the default temporary folder provided by
the operating
//system and create our own temporary directory with a randomly
chosen name there
File sysTempDir = new File(System.getProperty("java.io.tmpdir"));
File ourTempDir = new File(sysTempDir,
UUID.randomUUID().toString());
//We simply extract the files at the indicated path to our newly
created temporary directory
//The code is based on the one at
http://www.java-examples.com/extract-zip-file-subdirectories-using-command-line-argument-example
Enumeration e = jar.entries();
while (e.hasMoreElements()) {
//Each iteration we will take an entry and extract it if
necessary
ZipEntry entry = (ZipEntry) e.nextElement();
//If the file is not inside the given directory, we go to the
next entry
if (!entry.getName().startsWith(path))
continue;
//We will extract the files to our temporary directory
//Note that the part of the given path inside the JAR must be
removed
//(this will correspond to the root of the temporary directory)
File destinationFilePath = new File(ourTempDir,
entry.getName().substring(path.length()));
//We create directories if it is needed
destinationFilePath.getParentFile().mkdirs();
//We extract the file if it is not a directory
if (!entry.isDirectory()) {
BufferedInputStream bis = new
BufferedInputStream(jar.getInputStream(entry));
int b;
byte buffer[] = new byte[1024];
FileOutputStream fos = new
FileOutputStream(destinationFilePath);
BufferedOutputStream bos = new BufferedOutputStream(fos,
1024);
while ((b = bis.read(buffer, 0, 1024)) != -1)
bos.write(buffer, 0, b);
bos.flush();
bos.close();
bis.close();
}
}
return ourTempDir;
}
/**
* Deletes the specified file and all its contents if it is a directory
* @param file The file (or directory) to be deleted
* @return true if everything goes well, false otherwise
*/
public static boolean delete(File file) {
if (file.isDirectory())
for (String child: file.list())
if (!delete(new File(file, child)))
return false;
return file.delete();
}
public static void main(String[] argv) throws Exception {
//We extract the embedded resources of the Basque-English language
pair to a temporary directory
File resourcesDir = extractResources("apertium-eu-en/");
//We edit the mode file so that the resources paths point to the
ones at our temporary directory
String newMode = readFile(resourcesDir.toString() +
"/modes/eu-en.mode").replaceAll("PATH_TO_TEMP_DIR",
resourcesDir.toString());
writeFile(resourcesDir.toString() + "/modes/eu-en.mode", newMode);
//We call the main function of ApertiumMain with the right
parameters,
//using the resources that we have just extracted to a temporary
directory
String[] euEnArgs = {"-d", resourcesDir.getPath(), "eu-en"};
ApertiumMain.main(euEnArgs);
//We delete the temporary directory and all its content
if (!delete(resourcesDir))
System.err.println("Something failed while trying to remove the
temporary files...");
}
}
You could go a step deeper into what happens in ApertiumMain.main(euEnArgs)
and see if you could get the resources loaded directly from the JAR file.
Try if you can avoid modifying lttoolbox-java. In that case you can just
show your own files.
Jacob
--
Jacob Nordfalk <https://plus.google.com/114820443085046080944>
http://javabog.dk
Android-udvikler og underviser på
IHK<http://cv.ihk.dk/diplomuddannelser/itd/vf/MAU>og
Lund&Bendsen <https://www.lundogbendsen.dk/undervisning/beskrivelse/LB1809/>
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff