Re: ApacheCon Meetup

Matt Post Fri, 13 May 2016 09:17:15 -0700

This all looks really great; thanks for sending that list, Kellen.

Another small, very nice but low priority issue is the config file handling. 
I've seen some use of args4j (which license seems permissive enough, though 
IANAL: https://github.com/kohsuke/args4j/blob/master/LICENSE), but can anyone 
comment on how this compares to Apache's Commons CLI?


A Joshua API would be great. We have a good start on this; with a little more 
refactoring in the hypergraph code, this could be tied up pretty cleanly.

My biggest non-research interest lately is putting together more language packs.

Kellen: have you guys tried comparing KenLM to BerkeleyLM? I ran some brief 
experiments a while back, and it came out roughly even, presumably because of 
the JNI overhead. BerkeleyLM lags KenLM in a few ways:

- Language model construction is not nearly as nice or efficient as KenLM's 
"lmplz"; also it doesn't do proper Kneser-Ney smoothing

- BerkeleyLM doesn't do state collapsing the way KenLM does, so it misses a key 
search efficiency. However, this contribution is fairly minimal, I think. It 
could also be added to the LanguageModel code without too much trouble.

matt


> On May 13, 2016, at 12:10 PM, Tommaso Teofili <[email protected]> 
> wrote:
> 
> sorry guys, too late for me today ... have fun, it'd be good if you could
> send a wrap up on the list.
> 
> Regards,
> Tommaso
> 
> Il giorno ven 13 mag 2016 alle ore 18:08 Henry Saputra <
> [email protected]> ha scritto:
> 
>> Cool, sounds good =)
>> 
>> - Henry
>> 
>> On Thu, May 12, 2016 at 6:08 PM, kellen sunderland <
>> [email protected]> wrote:
>> 
>>> I just wanted to discuss it as a group.  Your approach looks good to me.
>>> 
>>> On Thu, May 12, 2016 at 6:05 PM, Henry Saputra <[email protected]>
>>> wrote:
>>> 
>>>> Ah sorry, trigger happy
>>>> 
>>>> About logging. Are you proposing to use log4j interface in the code? I
>>>> would recommend to use slf4j [1] as facade abstraction.
>>>> Then implementation could be done via log4j or logback.
>>>> 
>>>> Love to see API access to Joshua.
>>>> 
>>>> - Henry
>>>> 
>>>> [1] http://www.slf4j.org
>>>> 
>>>> On Thu, May 12, 2016 at 6:03 PM, Henry Saputra <
>> [email protected]>
>>>> wrote:
>>>> 
>>>>> About logging. Are you proposing to use log4j interface in the code?
>> I
>>>>> would recommend to use slf4j [1]
>>>>> 
>>>>> 
>>>>> [
>>>>> 
>>>>> On Thu, May 12, 2016 at 2:30 PM, kellen sunderland <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Thanks for organizing Lewis,
>>>>>> 
>>>>>> Here's some topics for discussion I've been noting while working
>> with
>>>>>> Joshua.  None of these are high priority issues for me, but if we
>> are
>>>> all
>>>>>> in agreement on them it might make sense to log them.
>>>>>> 
>>>>>> Boring code convention stuff: Logging with log4j, throw Runtime
>>>> Exceptions
>>>>>> instead of Typed, remove all system exits (replace with
>>>>>> RuntimeExceptions),
>>>>>> refactor some large files.
>>>>>> 
>>>>>> Testing: Integrate existing unit tests, provide some good test
>>> examples
>>>> so
>>>>>> others can begin adding more tests.
>>>>>> 
>>>>>> Configuration: We also touched on IoC, CLI args, and configuration
>>>> changes
>>>>>> that are possible.
>>>>>> 
>>>>>> OO stuff: Joshua is pretty good here, but I would personally prefer
>>> more
>>>>>> granular interfaces.  I wouldn't advocate radical changes, but
>> maybe a
>>>>>> little refactoring might make sense to better align with the
>> interface
>>>>>> segregation principle.
>>>>>> https://en.wikipedia.org/wiki/Interface_segregation_principle
>>>>>> 
>>>>>> JNI reliance:  We've found KenLM works really well with Joshua, but
>>>> there
>>>>>> is one issue with using it.  It requires many JNI calls during
>>> decoding
>>>>>> and
>>>>>> these calls impact GC performance.  In fact when a JNI call happens
>>> the
>>>> GC
>>>>>> throws out any work it may have done and quits until the JNI call
>>>>>> completes.  The GC will then resume and start marking objects for
>>>>>> collection from scratch.  This is not ideal especially for programs
>>> with
>>>>>> large heaps (Joshua / Spark).  There's a couple ways we could
>> mitigate
>>>>>> this
>>>>>> and I think they'd all speed up Joshua quite a lot.
>>>>>> 
>>>>>> High level roadmap topics:
>>>>>> 
>>>>>> *  Distributed Decoding is something I'll likely continue working
>> on.
>>>>>> Theres some obvious things we can do given usage patterns of
>>> translation
>>>>>> engines that can help us out here (I think).
>>>>>> *  Providing a way to optimize Joshua for low-latency,
>> low-throughput
>>>>>> calls
>>>>>> could be interesting for those with near real-time use cases.
>>>> Providing a
>>>>>> way to optimize for high-latency, high-throughput could be
>> interesting
>>>> for
>>>>>> async/batch use cases.
>>>>>> *  The machine learning optimization algorithms could be cleaned up
>> a
>>>> bit
>>>>>> (MERT/MIRA).
>>>>>> *  The Vocabulary could probably be replaced with a simpler
>>>> implementation
>>>>>> (without sacrificing performance).
>>>>>> 
>>>>>> -Kellen
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 12, 2016 at 12:32 PM, Lewis John Mcgibbney <
>>>>>> [email protected]> wrote:
>>>>>> 
>>>>>>> Hi Folks,
>>>>>>> Kellen, Henri and I are going to get together tomorrow 13th around
>>>>>>> lunchtime PST to talk everything Joshua.
>>>>>>> Would be great to have others online via GChat if possible.
>>>>>>> Let's say around 11am PST for the time being.
>>>>>>> See you then folks.
>>>>>>> Thanks
>>>>>>> Lewis
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> *Lewis*
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: ApacheCon Meetup

Reply via email to