Re: [Apertium-stuff] GSoC - Make lttoolbox-java embeddable

Jacob Nordfalk Sat, 24 Mar 2012 07:53:30 -0700

2012/3/23 Mikel Artetxe <[email protected]>

>
>>  Sure, make an API class. There are some APIs around that it
>>  should probably try to resemple (C++, webservice, ...). I'm not against it.
>>
>> An API that allows acces to the intermediary translation stages, like
>> Apertium-viewer would need, would be interesting for Apertium-viewer.
>>
>
> I think that this can be an interesting point to work on (develop an API
> that allows access to the intermediary stages and integrate it with
> Apertium-viewer).
>


Great.
Put that on the todo-list then :-)




>
> Here's another aspect, which I want to share with you about the loading of
>> resources - as it is inevitably related to how resources are packaged:
>>
>> When translating short sentences the Java port has the disadvantage that
>> it takes about 1/3 of a second to load.
>>
>> This is mainly due to the fact that the binary dictionaries (.bin files)
>> have to be loaded.
>>
>> The method is here:
>>
>> void read(InputStream input, Alphabet alphabet)
>>
>>
>> http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox-java/src/org/apertium/lttoolbox/process/TransExe.java?view=log
>>
>> (BTW Please see also C++ version - as you can see they are really similar
>>
>> http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox/lttoolbox/trans_exe.cc?revision=6607&view=markup
>>  )
>>
>> As you can see in the code this is mainly doing a lot of calls to
>> multibyte_read
>> (which essentially is a UTF-8 decoder) and converting input to a lot of
>> Node objects each having an array of pointers to other Node objects. During
>> translation there is a state machine jumping around in these nodes.
>>
>> So.... basically I am toying with the idea of NOT creating Node objects
>> etc, but just representing it all in one big array of chars or ints. An
>> array which could be read directly from disk.
>>
>> This will bring loading CPU time and startup time down considerably, both
>> for C++ and Java version.
>>
>> And... we have had some brief discussions about memmapping techniques,
>> which could, in principle, allow us to just load the parts of the array we
>> need into memory, thus further reducing startup time.
>>
>> So, just for consideration, there is an option that compiled dictionaries
>> could lie on the disk as 'raw memory chunks' which would be loaded whenever
>> needed (memmap).
>> (That is, if we agreee that this is an interesting path to explore.)
>>
>>
> I have been looking at the code with some deepness and I think that I have
> got a general understanding of it. Your solution makes sense for me, and I
> am sure that it is an interesting path to explore. But, to be honest, right
> know I don't see a clear way to accomplish the task (in other words, I
> understand the "what" but not the "how"). In any case, I am always willing
> to learn. So, could you specify it a bit more to see whether things get
> more clear for me, please?
>

The stuff I threw at you - as a thought becaurse I got warmed up by the
discussion - is probably a big mouthfull.
I think we should leave 'reducing start-up time' for now, as its not
neccesarily a task that has anything to do with embedding. Sorry.



>
> I have also thought that, alternatively, it might be possible to load
> everything as it is done now, but keeping it on memory for subsequent
> translations. That is, right now an execution of the program corresponds to
> a single translation. But, having in mind the final users (a desktop or a
> mobile app, for instance), it is very likely that one translation will be
> followed by another one, so why reload everything for each one? What I mean
> is that, although an extra 1/3 of a second for each translation can be a
> considerable time, waiting an extra 1/3 of a second for launching the
> application probably is not. I don't know if I have expressed my ideas
> correctly (I am sorry about my bad English), and maybe it's just a stupid
> idea....
>

I understand your idea. And thats exactly what would happen in Android,
actually; The OS keeps the process in memory as long as there is space, and
then, if memory is needed for other things, it kills the process. If the
process isnt killed, the stuff is ready in memory and from the program's
perspective it just get called twice.

So, reducing startup time work wouldnt be absolutely neccesary in Android.
Neither in web server apps.



>
> * - Additionally, I could work on some application that would use the
>>> embedded JAR*. This way, final users would have an easy to use program
>>> that requires no installation at all. At the same time, it could be a good
>>> example for programmers to see how they could use Apertium as a library for
>>> their own projects.
>>
>>
> That's a good idea.
>>
>
> Great!
>

OK, agreed. So, put it in the shedule.



>
>>
>>
>>> I could also write some tutorial regarding that. Depending on the
>>>> available time, they type of application could vary: making a simple Swing
>>>> application would be very easy,
>>>
>>>
>> To make and to understand and modify (if you make it really simple) .
>> That's why I don't think an API is that important.
>>
>>
>>
>>> but an Android app would require more time (although I have experience
>>>> on mobile development for iOS I haven't developed anything for Android, but
>>>> it shouldn't be too difficult for me to learn it, I guess).
>>>>
>>>
>> Fine, as long as the scope is embeddability, and not making sophisticated
>> GUIs.
>>
>> Anything beyond the simplest app would be for
>> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Apertium_on_your_mobile
>> . On the other hand, trying to deploy on Android is relevant as a test of
>> embedability and I think you could try together on Android.
>> You'd immedially have the problem that you should invoke a special
>> Android method (getCacheDir()) to know where you can put temporary files
>> (System.getProperty("java.io.tmpdir") won't  work)
>>
>> Now, we don't know how many GSoC slots we have and if both (or none of)
>> the projects gets a GSoC slot. Lets see what happens and then decide.
>>
>>
> Well, a GSoC application has to include a work plan, detailing what I
> would be working on each week. This is way I mentioned it, it is just an
> idea. In fact, I don't have any previous experience on Android development
> as I told you, so you probably have better prepared candidates for this
> task. What happens is that it is impossible for me to elaborate a work plan
> if we haven't really established what to do (and I only have two weeks for
> it...). My points aimed to be just some generals ideas regarding this...
>

Try to make a draft of the work plan and lets base discussions on that.



>
> You could go a step deeper into what happens in ApertiumMain.main(euEnArgs)
>> and see if you could get the resources loaded directly from the JAR file.
>>
>
> Do you mean without extracting them to a temporary directory? I am afraid
> that this is not possible by just modifying the classes at the
> org.apertium.pipeline package. Embedded resources cannot be loaded as
> Files, so deeper changes would be needed. That is, each program
> (ApertiumInterchunk, LTProc and so on) that is called through the
> dispatcher takes the path of the resources that it needs on its arguments,
> and load them by itself. This way, it would be necessary to adapt all of
> them to access the embedded resources, and not only ApertiumMain (in fact,
> ApertiumMain only deals with the .mode file, but it doesn't directly do
> anything with the rest).
>

Agreed. Its a bigger task that requires modification of the lttoolbox-java
code.



>
>
>
>> Try if you can avoid modifying lttoolbox-java. In that case you can just
>> show your own files.
>>
>
> I am sorry but I don't understand you. What do you mean with "just show
> your own files"? Do I have to print the content of the embedded files on
> screen or what?
>

Sorry. That sentence was unclear. I just meant that, to the extend
possible, if you work on new stuff, try to put the new stuff in new
classes, it makes it easier to see what was added (but I can always do a
diff).


Jacob

-- 
Jacob Nordfalk <https://plus.google.com/114820443085046080944>
http://javabog.dk
Android-udvikler og underviser på
IHK<http://cv.ihk.dk/diplomuddannelser/itd/vf/MAU>og
Lund&Bendsen <https://www.lundogbendsen.dk/undervisning/beskrivelse/LB1809/>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC - Make lttoolbox-java embeddable

Reply via email to