Re: [Apertium-stuff] GSoC - Make lttoolbox-java embeddable

Mikel Artetxe Sun, 25 Mar 2012 07:09:49 -0700

>
>  Here's another aspect, which I want to share with you about the loading
>>> of resources - as it is inevitably related to how resources are packaged:
>>>
>>> When translating short sentences the Java port has the disadvantage that
>>> it takes about 1/3 of a second to load.
>>>
>>> This is mainly due to the fact that the binary dictionaries (.bin files)
>>> have to be loaded.
>>>
>>> The method is here:
>>>
>>> void read(InputStream input, Alphabet alphabet)
>>>
>>>
>>> http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox-java/src/org/apertium/lttoolbox/process/TransExe.java?view=log
>>>
>>> (BTW Please see also C++ version - as you can see they are really similar
>>>
>>> http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox/lttoolbox/trans_exe.cc?revision=6607&view=markup
>>>  )
>>>
>>> As you can see in the code this is mainly doing a lot of calls to
>>> multibyte_read
>>> (which essentially is a UTF-8 decoder) and converting input to a lot of
>>> Node objects each having an array of pointers to other Node objects. During
>>> translation there is a state machine jumping around in these nodes.
>>>
>>> So.... basically I am toying with the idea of NOT creating Node objects
>>> etc, but just representing it all in one big array of chars or ints. An
>>> array which could be read directly from disk.
>>>
>>> This will bring loading CPU time and startup time down considerably,
>>> both for C++ and Java version.
>>>
>>> And... we have had some brief discussions about memmapping techniques,
>>> which could, in principle, allow us to just load the parts of the array we
>>> need into memory, thus further reducing startup time.
>>>
>>> So, just for consideration, there is an option that compiled
>>> dictionaries could lie on the disk as 'raw memory chunks' which would be
>>> loaded whenever needed (memmap).
>>> (That is, if we agreee that this is an interesting path to explore.)
>>>
>>>
>> I have been looking at the code with some deepness and I think that I
>> have got a general understanding of it. Your solution makes sense for me,
>> and I am sure that it is an interesting path to explore. But, to be honest,
>> right know I don't see a clear way to accomplish the task (in other words,
>> I understand the "what" but not the "how"). In any case, I am always
>> willing to learn. So, could you specify it a bit more to see whether things
>> get more clear for me, please?
>>
>
> The stuff I threw at you - as a thought becaurse I got warmed up by the
> discussion - is probably a big mouthfull.
> I think we should leave 'reducing start-up time' for now, as its not
> neccesarily a task that has anything to do with embedding. Sorry.
>


Well, maybe I can work on it, the only thing that I meant was that right
now I wouldn't really know how to accomplish it. And I don't want to accept
something unless I am confident that I can do it...



  but an Android app would require more time (although I have experience on
>>>> mobile development for iOS I haven't developed anything for Android, but it
>>>> shouldn't be too difficult for me to learn it, I guess).
>>>>
>>>
>> Fine, as long as the scope is embeddability, and not making sophisticated
>> GUIs.
>>
>> Anything beyond the simplest app would be for
>> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Apertium_on_your_mobile
>> . On the other hand, trying to deploy on Android is relevant as a test of
>> embedability and I think you could try together on Android.
>> You'd immedially have the problem that you should invoke a special
>> Android method (getCacheDir()) to know where you can put temporary files
>> (System.getProperty("java.io.tmpdir") won't  work)
>>
>> Now, we don't know how many GSoC slots we have and if both (or none of)
>> the projects gets a GSoC slot. Lets see what happens and then decide.
>>
>>
> Well, a GSoC application has to include a work plan, detailing what I
> would be working on each week. This is way I mentioned it, it is just an
> idea. In fact, I don't have any previous experience on Android development
> as I told you, so you probably have better prepared candidates for this
> task. What happens is that it is impossible for me to elaborate a work plan
> if we haven't really established what to do (and I only have two weeks for
> it...). My points aimed to be just some generals ideas regarding this...
>

Try to make a draft of the work plan and lets base discussions on that.
>


OK. This would be my very first draft of the work plan:

Week 1-3: Adapt lttoolbox-java so that it can directly work with embedded
files without the need of copying them to a temporary directory.
Week 4: Make an API class that easily allows the translation of an embedded
language pair. Work on a demo JAR executable usable from the command line
that would make use of it with a specific language pair.
Deliverable #1: The above mentioned JAR executable.
Week 5: Work on an API class that allows access to the intermediary
translation stages.
Week 6: Make a small user-oriented Swing application for a specific
language pair (something similar to apertium-tolk). The idea is that any
user with the only prerequisite of having JVM installed could download and
run it by just double-clicking it.
Week 7-8: Adapt and extend the previous application so that it can work
with several language pairs. This could be achieved by having a JAR per
language pair and the main JAR executable that makes use of them or by
integrating everything on a single JAR executable that is able to modify
itself in order to add or remove language pairs from it (I am not sure
about which approach would be better).
Deliverable #2: The Swing application developed on week 6-8.
Week 9-11: Integrate everything with apertium-viewer so that it can work
with the JAR files of the above mentioned Swing application.
Week 12: Suggested "pencils down" date: write documentation, test
everything...

To tell the truth, I am not really sure about it (probably too few things
to do and not very well specified...). What's your opinion? Any suggestion?


 You could go a step deeper into what happens in ApertiumMain.main(euEnArgs)
>>> and see if you could get the resources loaded directly from the JAR file.
>>>
>>
>> Do you mean without extracting them to a temporary directory? I am afraid
>> that this is not possible by just modifying the classes at the
>> org.apertium.pipeline package. Embedded resources cannot be loaded as
>> Files, so deeper changes would be needed. That is, each program
>> (ApertiumInterchunk, LTProc and so on) that is called through the
>> dispatcher takes the path of the resources that it needs on its arguments,
>> and load them by itself. This way, it would be necessary to adapt all of
>> them to access the embedded resources, and not only ApertiumMain (in fact,
>> ApertiumMain only deals with the .mode file, but it doesn't directly do
>> anything with the rest).
>>
>
> Agreed. Its a bigger task that requires modification of the lttoolbox-java
> code.
>
>

Yes. This is way I have added it to the work plan. In any case, is there
anything that I should work on regarding the coding challenge or do you
consider it finished?


Mikel

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC - Make lttoolbox-java embeddable

Reply via email to