Re: incubator wiki

2016-06-07 Thread Matt Post
Hi Chris,

You already had full permissions to the whole space. I had locked down one page 
(is that what you were talking about?) and just added you to that.

matt


> On Jun 7, 2016, at 1:22 AM, Mattmann, Chris A (3980) 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> hey Matt can you grant perms to chrismattmann (username)
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 6/6/16, 6:03 PM, "Matt Post" <p...@cs.jhu.edu> wrote:
> 
>> Hi everyone,
>> 
>> I made the confluence page public (read-only), as part of transitioning the 
>> website there. It didn't seem to me that anything there was private, but if 
>> something should be, we can lock down individual pages to members only.
>> 
>> (Does anyone know how to have a Confluence group created?)
>> 
>> matt



***UNCHECKED*** Re: [jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring

2016-05-25 Thread Matt Post


binCRkioZFHru.bin
Description: PGP/MIME Versions Identification


Re: [jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring

2016-05-25 Thread Matt Post
Having written that, factoring the pipeline would be a good first step to 
replacing the guts of the pipeline. It's worth noting that many of these are 
already done:

- alignment is handled by $JOSHUA/scripts/training/paralign.pl
- tuning is handled by $JOSHUA/scripts/training/run_tuner.py
- there is a script for running Thrax ($JOSHUA/scripts/training/run_thrax.py), 
but it is not pulled into the decoder yet

However, Lewis' basic point stands: the pipeline is a mess, and it would be 
good to have good interfaces to each of the subtasks, as an intermediate step 
to replacing the logic of the pipeline with a more versatile (and readable) 
tool like ducttape.

matt


> On May 24, 2016, at 7:27 PM, Matt Post (JIRA) <j...@apache.org> wrote:
> 
> 
>   [ 
> https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299141#comment-15299141
>  ] 
> 
> Matt Post commented on JOSHUA-270:
> --
> 
> The pipeline is a huge mess, probably not worth salvaging. I'm hoping (maybe 
> this year?) to rewrite it, perhaps using this: 
> https://github.com/jhclark/ducttape/
> 
>> pipeline.pl needs major refactoring
>> ---
>> 
>>   Key: JOSHUA-270
>>   URL: https://issues.apache.org/jira/browse/JOSHUA-270
>>   Project: Joshua
>>Issue Type: Bug
>>Components: pipeline
>>  Affects Versions: 6.0.5
>>  Reporter: Lewis John McGibbney
>>   Fix For: 6.1
>> 
>> 
>> Right now 
>> [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl]
>>  is well over 2000 lines long and extremely difficult to navigate. 
>> I propose the following
>> * All ENV is refactored into an pipeline_environment file
>> * All Command line parsing and definitions are refactored into a 
>> pipeline_cli file
>> * Sanity checking is refactored into a pipeline_sanity_check file
>> * Dependenct Variable Checking is refactored into 
>> pipeline_dependent_variable_setting file
>> * filter and preprocess corpora is refactored into 
>> pipeline_filter_preprocess_corpora
>> * pipeline_subsampling becomes a file
>> * pipeline_alignment becomes a file
>> * pipeline_parsing becomes a file
>> * pipeline_thrax becomes a file
>> * pipeline_tuning becomes a file
>> * pipeline_testing becomes a file
>> * pipeline_subreoutines becomes a file
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)



joshua API changes

2016-05-25 Thread Matt Post
Hi folks (especially Felix, Kellen, Tobi) — 

I made two moderate improvements to Joshua on the way home. The first was to 
get rid of all the specialized phrase handling; the packer now works as we 
discussed, packing everything into Hiero format, and the stack-based decoder 
uses this directly now. Everything should be backwards compatible for hiero, 
but it's not for phrase-based. I added a "version = 3" line to the packer 
config to distinguish this, along with a check, so the decoder will throw a 
runtime exception if you try to load something incompatible. If anything fails, 
instead of repacking your grammar, just add the line "version = 3" to the 
packer config. The changes only affect packing for phrase-based models, so I 
don't think it will matter to you. This is pushed up into master.

The bigger one is on a JOSHUA-273 branch. I just pushed up a refactoring of the 
KBestExtraction / structured translation interface, per our discussions this 
week. However, I wasn't actually sure how to use the API. What is the entry 
point? Are you calling translate() directly and managing your own thread pools? 
It doesn't seem like you would be using Decoder.decode() or decodeAll(), since 
they're not very API-ish.

If you want to take a look at the changes, I'd welcome feedback, direct 
changes, etc. Here is a description of the major changes:

- Large refactor of the Translation output interface

- Instead of returning Translation objects, the calls to Decoder.translate() 
now return HyperGraph objects. As before, a HyperGraph represents the complete 
(pruned) search space the decoder explored. A HyperGraph can then be operated 
on by KBestExtractors and by the new TranslationFactory object, so that it can 
be thrown away.

- KBestExtractor is now an iterator that takes a HyperGraph object and returns 
DerivationState objects, each representing a single derivation tree

- Translation and StructuredTranslation are now combined. Translation is 
effectively a dummy object with a number of fields of interest that get 
populated by TranslationFactory, per explicit requests. Each request returns 
the TranslationFactory object, so you can easily chain calls, and then retrieve 
the Translation object at the end. e.g.,

KBestExtractor extractor = new KBestExtractor(hg, ...).
for (DerivationState derivation: extractor) {
TranslationFactory factory = new TranslationFactory(derivation, 
...)
Translation translation = factory.alignments()

.formattedTranslation(config.outputFormat)
.features()
.translation();
}

- Neither KBestExtractors nor Translation objects do any printing. This 
improved encapsulation is a big improvement over the past. After building your 
Translation objects, they will contain only small objects such as strings, 
feature vectors, and alignments, that can be safely passed downstream while the 
HyperGraph gets destroyed. Also, code for processing and formatting is all now 
in one place, the TranslationFactory.

- Also, I removed the forest rescoring and OracleExtraction classes. These are 
useful but not used, and are hard to read and should therefore be rewritten. I 
will do this at some point.

There are still a few things broken on the branch, but they are small and I am 
working to fix them. If you have a minute to poke around on the branch, please 
do, so that the end result is what you imagined when we were chatting the other 
day.

matt

too many emails

2016-05-25 Thread Matt Post
Does someone know how to turn off the mailing of all github comments to dev?

The way I see it, we all have to be on dev, so it should be for people, not 
robots. I am getting every comment about three times.

I would just do it but I don't know how.

incubator wiki

2016-06-06 Thread Matt Post
Hi everyone,

I made the confluence page public (read-only), as part of transitioning the 
website there. It didn't seem to me that anything there was private, but if 
something should be, we can lock down individual pages to members only.

(Does anyone know how to have a Confluence group created?)

matt

Re: Wiki access

2016-05-26 Thread Matt Post
Hi Tom — This is a dumb question, but where is the Joshua wiki? You're not 
talking about the confluence page, are you? I see you have access there.

https://cwiki.apache.org/confluence/display/JOSHUA/Joshua+%28Incubating%29+Home

matt




> On May 25, 2016, at 5:20 PM, Tom Barber  wrote:
> 
> Hello
> 
> Can someone give me(bugg_tb) access to the Joshua wiki please.
> 
> Ta
> 
> Tom
> --
> 
> Director Meteorite.bi - Saiku Analytics Founder
> Tel: +44(0)5603641316
> 
> (Thanks to the Saiku community we reached our Kickstart
> 
> goal, but you can always help by sponsoring the project
> )



Re: [GitHub] incubator-joshua pull request: JOSHUA-252 Make it possible to use ...

2016-05-26 Thread Matt Post
yeah this is really strange. I'm talking about the regression tests, not the 
unit tests. these are in src/test/resources. run for example 
test/bn-en/hiero/test.sh. 3 seconds on master, 18 on JOSHUA-252 (you might have 
to remove "-threads 2")

matt (from my phone)

> On May 26, 2016, at 9:15 PM, lewismc  wrote:
> 
> Github user lewismc commented on the pull request:
> 
>https://github.com/apache/incubator-joshua/pull/12#issuecomment-222037050
> 
>What kind of 'tests' are we talking about here? The only tests which I 
> know of are invoked by running ```mvn clean test```. These execute very 
> quickly by the looks of it however I assume that they cover little or none of 
> the multithreaded functionality you are referring to. 
>Do you have something we could code, profile and see what is taking so 
> long @mjpost  ? Thanks
> 
>One last thing. No I changed absolutely no code. I did however introduce 
> some classes such as the ArpaFile, etc which were required to get the current 
> tests running. 
> 
> 
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---



junit in eclipse

2016-06-01 Thread Matt Post
Has anyone successfully run JUnit tests in Eclipse? I'd love to integrate them 
but am not sure how to go about setting it up. I thought I'd ask before burning 
the time on Google. I'll volunteer to write a wiki article if you can help me 
out :)

matt

Re: [jira] [Commented] (JOSHUA-264) Remove system exits and replace with RuntimeExceptions

2016-06-14 Thread Matt Post
Go ahead :)


> On Jun 14, 2016, at 2:10 PM, Thamme Gowda (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330600#comment-15330600
>  ] 
> 
> Thamme Gowda commented on JOSHUA-264:
> -
> 
> Yes, this task is completed and we can close this issue
> 
>> Remove system exits and replace with RuntimeExceptions
>> --
>> 
>>Key: JOSHUA-264
>>URL: https://issues.apache.org/jira/browse/JOSHUA-264
>>Project: Joshua
>> Issue Type: Improvement
>>   Reporter: Kellen Sunderland
>> 
>> When Joshua is used a library it's much more convenient to get 
>> RuntimeExceptions when a fatal error happens.  This way the host process can 
>> possibly handle the error or take some appropriate action (alarm, log, etc).
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)



Fwd: Build failed in Jenkins: joshua_master #71

2016-06-23 Thread Matt Post
This is going to be a problem — having a portion of the test suite depending on 
KenLM, which is not bundled, distributable, or platform-independent...


> Begin forwarded message:
> 
> From: Apache Jenkins Server 
> Subject: Build failed in Jenkins: joshua_master #71
> Date: June 23, 2016 at 10:15:08 AM EDT
> To: dev@joshua.incubator.apache.org, p...@jgeppert.com, antoni...@riseup.net, 
> da...@davidkarlsen.com, fhie...@amazon.com
> Reply-To: dev@joshua.incubator.apache.org
> 
> See 
> 
> Changes:
> 
> [david] Migrated the Dockerfile to use new Maven build
> 
> --
> Started by an SCM change
> [EnvInject] - Loading node environment variables.
> Building remotely on ubuntu-5 (docker Ubuntu ubuntu5 ubuntu yahoo-not-h2) in 
> workspace 
>> git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>> git config remote.origin.url 
>> https://git-wip-us.apache.org/repos/asf/incubator-joshua.git # timeout=10
> Fetching upstream changes from 
> https://git-wip-us.apache.org/repos/asf/incubator-joshua.git
>> git --version # timeout=10
>> git -c core.askpass=true fetch --tags --progress 
>> https://git-wip-us.apache.org/repos/asf/incubator-joshua.git 
>> +refs/heads/*:refs/remotes/origin/*
>> git rev-parse refs/remotes/origin/master^{commit} # timeout=10
>> git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
> Checking out Revision 6aef63d7893f776f2d23e0f797ed7dc1123c756a 
> (refs/remotes/origin/master)
>> git config core.sparsecheckout # timeout=10
>> git checkout -f 6aef63d7893f776f2d23e0f797ed7dc1123c756a
>> git rev-list f3a511836d139f270ce2ffaeefd39090b3cbb826 # timeout=10
> [joshua_master] $ /home/jenkins/tools/maven/apache-maven-3.0.4/bin/mvn clean 
> deploy javadoc:aggregate
> [INFO] Scanning for projects...
> [INFO]
>  
> [INFO] 
> 
> [INFO] Building Apache Joshua Machine Translation Toolkit 6.0.6-SNAPSHOT
> [INFO] 
> 
> [INFO] 
> [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ joshua ---
> [INFO] Deleting 
> [INFO] 
> [INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ joshua ---
> [INFO] 
> [INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ joshua 
> ---
> [debug] execute contextualize
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 1 resource
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ joshua ---
> [INFO] Compiling 268 source files to 
> 
> [INFO] 
> [INFO] --- maven-resources-plugin:2.5:testResources (default-testResources) @ 
> joshua ---
> [debug] execute contextualize
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 321 resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ 
> joshua ---
> [INFO] Compiling 44 source files to 
> 
> [INFO] 
> [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ joshua ---
> 
> ---
> T E S T S
> ---
> Running TestSuite
> ERROR - * FATAL: Can't find libken.so (libken.dylib on OS X) in $JOSHUA/lib
> ERROR - *This probably means that the KenLM library didn't compile.
> ERROR - *Make sure that BOOST_ROOT is set to the root of your boost
> ERROR - *installation (it's not /opt/local/, the default), change to
> ERROR - *$JOSHUA, and type 'ant kenlm'. If problems persist, see the
> ERROR - *website (joshua-decoder.org).
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> %
> %
> %
> %
> %
> %
> %
> %
> %
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> Tests run: 131, Failures: 1, Errors: 0, Skipped: 6, Time elapsed: 2.242 sec 
> <<< FAILURE! - in TestSuite
> setUp(org.apache.joshua.decoder.ff.lm.class_lm.ClassBasedLanguageModelTest)  
> Time elapsed: 0.517 sec  <<< FAILURE!
> java.lang.ExceptionInInitializerError
>   at 
> 

Re: [IMPORTANT] Roadmap for 6.1 Release

2016-06-23 Thread Matt Post
Hi Lewis,

Sorry for taking some time to get back to you. I think the roadmap looks great. 
One thing, though, is that the Amazon folks and I have discussed making a 
number of backwards-incompatible changes in an effort to modernize some pieces 
of the code. This would have to do with things like the config file format, a 
totally new pipeline based on duct tape, and some other ideas. We think those 
changes would be suitable for a 7.0 release (major version number change 
signals backwards incompatibility).

I think we've been doing some good work on improving Joshua, but at the same 
time, I think the release cycle is still little too accelerated for me. I would 
like to push back to semi- yearly or even yearly releases, with bug fixes in 
between. However, I'm also curious how this might affect our ability to move 
out of incubation. Do you have any thoughts on this?

The major downsides to releases are documentation. It's just hard to find the 
time to do. 

My own thoughts for what I'd like to do:

- Maybe a 6.1 release (soon, to get it out of the way? or otherwise this 
fall?), where we formalize the Apache move and maybe formalize the release of a 
handful of language packs, without a lot of other changes

- Write a linux.com article advertising this, hopefully attracting some 
attention

- Shoot for a 7.0 release with many of the changes we've discussed (some 
offline). If we get a good showing at MT Marathon in Prague this year, that 
could be a good time to get all of that in order.

- Start getting to work on a version of Joshua that swaps out the core decoder 
for a neural approach

matt




> On Jun 23, 2016, at 4:13 PM, Tom Barber  wrote:
> 
> I would volunteer some cycles for multi model support in the server and an
> improved rest interface and basic UI for end user interaction if you fancy
> it.
> 
> --
> 
> Director Meteorite.bi - Saiku Analytics Founder
> Tel: +44(0)5603641316
> 
> (Thanks to the Saiku community we reached our Kickstart
> 
> goal, but you can always help by sponsoring the project
> )
> 
> On 23 June 2016 at 21:10, Lewis John Mcgibbney 
> wrote:
> 
>> Hi Folks,
>> Anyone have any comments on this?
>> Seeing that the Maven multimodule project seems to be taking flight, it
>> would be nice to see where the roadmap is going?
>> Any comments would be great. Also, I'm kinda lost as to what is happening
>> with Jira but it looks like it is not really being used for much.
>> Thanks
>> 
>> On Mon, Jun 20, 2016 at 11:34 AM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>> 
>>> Hi Folks,
>>> I've just smartened up Jira a bit with our Roadmap being defined as
>> follows
>>> 
>>> 
>>> 
>> https://issues.apache.org/jira/browse/joshua/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel
>>> 
>>> Right now there are only 14/14 issues as RESOLVED for 6.1. This is false
>>> as I know that many more issues have been addressed however I don't think
>>> that Jira tickets have been created for all changes to the source code.
>>> Maybe moving forward we could open Jira issues and link them to the
>> Github
>>> tickets via commit messages?
>>> 
>>> Additionally, everything that was currently UNRESOLVED has merely been
>>> pushed to 6.2. If this is not what is required then please reassign the
>> fix
>>> version for any ticket(s) to 6.1 and we can fix.
>>> 
>>> Finally, are there any mitigating factor which would prevent a 6.1
>> release
>>> candidate being prepared right now?
>>> Thanks
>>> Lewis
>>> 
>>> --
>>> *Lewis*
>>> 
>> 
>> 
>> 
>> --
>> *Lewis*
>> 



Re: hosting release files

2016-04-05 Thread Matt Post
The Joshua release is just a hundred megabytes or so. If we exclude Hadoop and 
other tools used for building (which I think we should do), the release is more 
like tens of megabytes.

For language packs, I think a reasonable expectation is 2--3 gigabytes each.

matt


> On Apr 5, 2016, at 2:09 PM, Mattmann, Chris A (3980) 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> Yep absolutely. When we *release* Apache Joshua (FYI here is a 
> guide to creating an Incubator release [1]), we can also release
> large tarballs as well. If the files are >1 GB we need to inform
> Apache infrastructure as the files are mirrored around the world.
> 
> What are the expected release sizes, and what is our overall 
> expectation for the contribution of each language pack to the 
> release?
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 4/5/16, 1:57 PM, "Matt Post" <p...@cs.jhu.edu> wrote:
> 
>> Does Apache provide a place to host releases, language packs, and other 
>> (potentially large) files? Right now, they're all under my home directory at 
>> Hopkins, and it would be nice to put them in a more formal location (where 
>> I'm not pushing up against a quota).
>> 
>> matt



Re: Logo for Joshua

2016-04-05 Thread Matt Post
Yes, thanks, Lewis! By stickers, are you referring to physical stickers?

I like this a lot (in particular that Apache feather), but am thinking it might 
be worthwhile to hire a graphic designer to see if she / he could weave them 
together with a bit more finesse. Any objections to this, or is it too late?

matt



> On Mar 30, 2016, at 3:25 AM, Tommaso Teofili <tommaso.teof...@gmail.com> 
> wrote:
> 
> thanks Lewis! It looks good and yes, stickers please :)
> 
> Il giorno mer 30 mar 2016 alle ore 03:36 Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> ha scritto:
> 
>> https://issues.apache.org/jira/browse/JOSHUA-249
>> 
>> On Tue, Mar 29, 2016 at 6:33 PM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>> 
>>> Hi Folks,
>>> With the very basic merging skills I have in Gimp I whipped together a
>>> rudimentary fethered Joshua logo
>>> http://home.apache.org/~lewismc/apache_joshua_logo.png
>>> with the corresponding mutli-layered file
>>> http://home.apache.org/~lewismc/apache_joshua_logo.xcf
>>> If you guys want to use this then by all means please do. We can also
>>> request a powered by sticker from press@
>>> Thanks
>>> 
>>> On Tue, Mar 29, 2016 at 9:40 AM, Lewis John Mcgibbney <
>>> lewis.mcgibb...@gmail.com> wrote:
>>> 
>>>> ACK
>>>> 
>>>> On Tue, Mar 29, 2016 at 9:20 AM, Matt Post <p...@cs.jhu.edu> wrote:
>>>> 
>>>>> Even better, there's a vectorized version:
>>>>> 
>>>>>http://joshua-decoder.org/images/joshua-logo.pdf
>>>>> 
>>>>> I'm not a graphic designer, but an easy approach would be to put the
>>>>> "powered-by" logo around the Joshua one.
>>>>> 
>>>>> matt
>>>>> 
>>>>> 
>>>>>> On Mar 29, 2016, at 11:56 AM, Lewis John Mcgibbney <
>>>>> lewis.mcgibb...@gmail.com> wrote:
>>>>>> 
>>>>>> http://www.apache.org/foundation/press/kit/
>>>>>> 
>>>>>> On Tue, Mar 29, 2016 at 8:47 AM, Mattmann, Chris A (3980) <
>>>>>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>>>>> 
>>>>>>> would be great to brand it..
>>>>>>> 
>>>>>>> ++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Chief Architect
>>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>>> Email: chris.a.mattm...@nasa.gov
>>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>>> ++
>>>>>>> Director, Information Retrieval and Data Science Group (IRDS)
>>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> WWW: http://irds.usc.edu/
>>>>>>> ++
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -Original Message-
>>>>>>> From: Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
>>>>>>> Reply-To: "dev@joshua.incubator.apache.org"
>>>>>>> <dev@joshua.incubator.apache.org>
>>>>>>> Date: Tuesday, March 29, 2016 at 8:26 AM
>>>>>>> To: "dev@joshua.incubator.apache.org" <
>>>>> dev@joshua.incubator.apache.org>
>>>>>>> Subject: Logo for Joshua
>>>>>>> 
>>>>>>>> Hi Folks,
>>>>>>>> A bit of fun now...
>>>>>>>> The current logo for Joshua can be found at [0], I actually quite
>>>>> like the
>>>>>>>> color.
>>>>>>>> Does anyone want to take on the task of branding it? Or do you want
>>>>> to
>>>>>>>> leave it as it is?
>>>>>>>> Ta
>>>>>>>> 
>>>>>>>> [0] http://joshua-decoder.org/images/joshua-logo-small.png
>>>>>>>> 
>>>>>>>> --
>>>>>>>> *Lewis*
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> *Lewis*
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Lewis*
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> *Lewis*
>>> 
>> 
>> 
>> 
>> --
>> *Lewis*
>> 



Re: Release Cycle for Joshua

2016-04-12 Thread Matt Post
May 1 is going to be hard for me. I'd like to advocate for quarterly releases 
with a June 1 first release. Would that be okay?


> On Apr 12, 2016, at 1:31 AM, Lewis John Mcgibbney  
> wrote:
> 
> Cool.
> I've got the 1st Incubating release provisionally down for 1st May. It
> would be dynamite if we could get an RC ready for then.
> 
> 
> On Mon, Apr 11, 2016 at 10:13 PM, Tommaso Teofili > wrote:
> 
>> +1 for a (not too strict) scheduled cycle, 2 months pace sounds appropriate
>> for starting.
>> 
>> Regards,
>> Tommaso
>> 
>> Il giorno lun 11 apr 2016 alle ore 19:51 Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> ha scritto:
>> 
>>> Hi Folks,
>>> Any interested in setting a release cycle for Joshua e.g. every two
>> months,
>>> quarterly?
>>> Would be nice to get consensus on this and even maybe make our first
>>> release prior to ApacheCon coming up.
>>> Thanks
>>> Lewis
>>> 
>>> --
>>> *Lewis*
>>> 
>> 
> 
> 
> 
> -- 
> *Lewis*



Re: ApacheCon 2016 and Joshua

2016-03-19 Thread Matt Post
I'm going to look into making a short trip. I think I'd arrive on the 11th and 
then leave on Friday the 13th. Could we plan a meet up for the night of 
Thursday the 12th? It'd be great to meet everyone (and having a deadline would 
help me prioritize :) )

matt


> On Mar 15, 2016, at 12:24 PM, Lewis John Mcgibbney 
> <lewis.mcgibb...@gmail.com> wrote:
> 
> Hi Matt,
> 
> On Mon, Mar 14, 2016 at 8:26 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Whoa! Lewis, can you give some more detail on this talk, what you
>> proposed, and what you plan to talk about?
>> 
> 
> http://sched.co/6OJI
> 
> 
>> 
>> I haven't ever been to ApacheCon, but am interested in going. I don't have
>> much of a feel for what motivates folks outside the academic research
>> community, and that would be good to have in laying out projects that might
>> interest people.
>> 
> 
> I agree. Would be great to meet you there. We could have a Joshua meetup.
> 
> 
>> 
>> Regarding those project, I have a number of them. Perhaps it would be
>> useful to flesh them out with some more detail, and perhaps post them, for
>> those who are interested. First, with respect to Tommaso's question, the
>> following:
>> 
>> - Use cases. I'd really like to push machine translation as a black box,
>> where people can download and use models, not caring how they work, and
>> building on top of them. I think this could be transformative. I've just
>> added to Joshua the ability to add, store, and manage custom phrasal
>> translation rules, which would let people take a model and add their own
>> translations on top of it, perhaps correcting mistakes as they encounter
>> them. There's a JSON API for it (undocumented).
>> 
>> Building this up would also require pulling together lots of different
>> test sets, evaluating changes, and so on.
>> 
>> - Neural nets. This is a huge research area. I think the advantages are
>> that it could enable releasing models that are much smaller. However, on
>> the down side, it's not clear what the best way to integrate these models
>> into Joshua is. Fully neural attention models would require re-architecting
>> Joshua, as they are essentially a new paradigm. Adding neural components as
>> feature functions that interact with the existing decoding algorithm would
>> be an intermediate step.
>> 
> 
> OK. This sounds like bang on for a meet up topic. Regardless of who is
> there, we could have a Webex or something similar for the incubating
> community,
> 
> 
>> 
>> For other projects, I'd love:
>> 
>> - Better documentation, developer and end-user (probably I need to write a
>> lot of this; if nothing else, it would be hugely useful to me in terms of
>> prioritizing to know that people want it)
> 
> 
>> - Rewriting certain components. The tuning modules, in particular, are a
>> real mess, and should be synthesized and improved.
>> 
>> - Replacing Moses components. Joshua can call out to Moses to build phrase
>> tables; it would be nice to get rid of this (and wouldn't be that hard)
>> with our own Java implementations. It would also be good to add a
>> lexicalized distortion model to the phrase-based decoder.
>> 
>> 
> These all sound excellent and would all make very reasonable GSoC projects,
> Thanks
> Lewis



Re: Migrating Community from Github and GoggleGroups to Apache

2016-03-24 Thread Matt Post
(offline yesterday and today, will do tomorrow)

matt (from my phone)

> On Mar 24, 2016, at 1:48 PM, Lewis John Mcgibbney  
> wrote:
> 
> Hi Matt,
> As the primary figure within the Joshua community I wonder if you can act
> on the following. It will go a long way in enabling us to transition things
> over.
> 
>   1. Can you create a new branch on the Github repos at [0] called
>   'apache' with only a README which states that the Joshua project has been
>   transitioned to the Apache Incubator. A link to the Github repos at [1]
>   (I've opened [2] to get this set up) would be great. Also a link to the new
>   website (once we get it up and running) at [3].
>   2. Post a message to the existing Joshua Google Groups which states that
>   same as the above however references the new mailing lists at
>   u...@joshua.incubator.apache.org and dev@joshua.incubator.apache.org
>   3. Post to the Joshua Website that the project has been migrated over to
>   the Apache Software Foundation and provide relevant links as posted above.
> 
> If we address the above it would be ideal.
> 
> Thanks Matt, please let me know if there are any clarifications required.
> 
> [0] https://github.com/joshua-decoder/joshua
> 
> [1] http://github.com/apache/incubator-joshua
> 
> [2] https://issues.apache.org/jira/browse/INFRA-11539
> 
> [3] http://joshua.incubator.apache.org
> 
> -- 
> *Lewis*



Re: Migrating Community from Github and GoggleGroups to Apache

2016-03-26 Thread Matt Post
Lewis,

What are the framework options? I was hoping that we could take advantage of 
the move to improve Joshua's website, which is really quite ugly. Is there any 
help available for this? Are there common Apache themes or something we could 
use? I currently have Jekyll underneath, so it wouldn't be that hard to swap 
out some templates. Jekyll is nice because you can write in Markdown and then 
convert, but I am open to other ideas if there's something newer and nicer. I 
would love to move to something wiki-like that permits web-based editing, 
instead of pushing and pulling with git.

We could just start by seeding the current website, but I am a little afraid 
that decreases the chances we'll do something new.

matt


> On Mar 25, 2016, at 3:22 PM, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> 
> wrote:
> 
> The only concern I have is if the codebases now diverge. That is a very
> real possibility Nd I've seen it happen time and time again.
> 
> We are essentially waiting in absolutely nothing to get the website
> running, all we are waiting on is deciding on a framework, etc.
> 
> If you guys want I can just clone and launch the existing website over on
> joshua.incubator.apache.org (so that something is there), we could then do
> the other tasks over the weekend.
> Any thoughts on that?
> 
> On Friday, March 25, 2016, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Lewis -- regarding your requests 1--3, I've been waiting until we had
>> versions of everything migrated before announcing the move (you asked me a
>> long time ago to make announcements). What do we need to do before the new
>> website is live? It seems worthwhile to have that running before pointing
>> people at it.
>> 
>> matt
>> 
>> 
>>> On Mar 25, 2016, at 2:22 PM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com <javascript:;>> wrote:
>>> 
>>> BOOM
>>> 
>>> https://github.com/apache/incubator-joshua
>>> 
>>> 
>>> 
>>> On Thu, Mar 24, 2016 at 10:55 AM, Lewis John Mcgibbney <
>>> lewis.mcgibb...@gmail.com <javascript:;>> wrote:
>>> 
>>>> Thanks Matt, enjoy the time off (hopefully you are not ill)
>>>> Later
>>>> 
>>>> On Thu, Mar 24, 2016 at 10:54 AM, Matt Post <p...@cs.jhu.edu
>> <javascript:;>> wrote:
>>>> 
>>>>> (offline yesterday and today, will do tomorrow)
>>>>> 
>>>>> matt (from my phone)
>>>>> 
>>>>>> On Mar 24, 2016, at 1:48 PM, Lewis John Mcgibbney <
>>>>> lewis.mcgibb...@gmail.com <javascript:;>> wrote:
>>>>>> 
>>>>>> Hi Matt,
>>>>>> As the primary figure within the Joshua community I wonder if you can
>>>>> act
>>>>>> on the following. It will go a long way in enabling us to transition
>>>>> things
>>>>>> over.
>>>>>> 
>>>>>> 1. Can you create a new branch on the Github repos at [0] called
>>>>>> 'apache' with only a README which states that the Joshua project has
>>>>> been
>>>>>> transitioned to the Apache Incubator. A link to the Github repos at
>>>>> [1]
>>>>>> (I've opened [2] to get this set up) would be great. Also a link to
>>>>> the new
>>>>>> website (once we get it up and running) at [3].
>>>>>> 2. Post a message to the existing Joshua Google Groups which states
>>>>> that
>>>>>> same as the above however references the new mailing lists at
>>>>>> u...@joshua.incubator.apache.org <javascript:;> and
>> dev@joshua.incubator.apache.org <javascript:;>
>>>>>> 3. Post to the Joshua Website that the project has been migrated over
>>>>> to
>>>>>> the Apache Software Foundation and provide relevant links as posted
>>>>> above.
>>>>>> 
>>>>>> If we address the above it would be ideal.
>>>>>> 
>>>>>> Thanks Matt, please let me know if there are any clarifications
>>>>> required.
>>>>>> 
>>>>>> [0] https://github.com/joshua-decoder/joshua
>>>>>> 
>>>>>> [1] http://github.com/apache/incubator-joshua
>>>>>> 
>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-11539
>>>>>> 
>>>>>> [3] http://joshua.incubator.apache.org
>>>>>> 
>>>>>> --
>>>>>> *Lewis*
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Lewis*
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> *Lewis*
>> 
>> 
> 
> -- 
> *Lewis*



Re: Migrating Community from Github and GoggleGroups to Apache

2016-03-25 Thread Matt Post
Lewis -- regarding your requests 1--3, I've been waiting until we had versions 
of everything migrated before announcing the move (you asked me a long time ago 
to make announcements). What do we need to do before the new website is live? 
It seems worthwhile to have that running before pointing people at it.

matt


> On Mar 25, 2016, at 2:22 PM, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> 
> wrote:
> 
> BOOM
> 
> https://github.com/apache/incubator-joshua
> 
> 
> 
> On Thu, Mar 24, 2016 at 10:55 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
> 
>> Thanks Matt, enjoy the time off (hopefully you are not ill)
>> Later
>> 
>> On Thu, Mar 24, 2016 at 10:54 AM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>>> (offline yesterday and today, will do tomorrow)
>>> 
>>> matt (from my phone)
>>> 
>>>> On Mar 24, 2016, at 1:48 PM, Lewis John Mcgibbney <
>>> lewis.mcgibb...@gmail.com> wrote:
>>>> 
>>>> Hi Matt,
>>>> As the primary figure within the Joshua community I wonder if you can
>>> act
>>>> on the following. It will go a long way in enabling us to transition
>>> things
>>>> over.
>>>> 
>>>>  1. Can you create a new branch on the Github repos at [0] called
>>>>  'apache' with only a README which states that the Joshua project has
>>> been
>>>>  transitioned to the Apache Incubator. A link to the Github repos at
>>> [1]
>>>>  (I've opened [2] to get this set up) would be great. Also a link to
>>> the new
>>>>  website (once we get it up and running) at [3].
>>>>  2. Post a message to the existing Joshua Google Groups which states
>>> that
>>>>  same as the above however references the new mailing lists at
>>>>  u...@joshua.incubator.apache.org and dev@joshua.incubator.apache.org
>>>>  3. Post to the Joshua Website that the project has been migrated over
>>> to
>>>>  the Apache Software Foundation and provide relevant links as posted
>>> above.
>>>> 
>>>> If we address the above it would be ideal.
>>>> 
>>>> Thanks Matt, please let me know if there are any clarifications
>>> required.
>>>> 
>>>> [0] https://github.com/joshua-decoder/joshua
>>>> 
>>>> [1] http://github.com/apache/incubator-joshua
>>>> 
>>>> [2] https://issues.apache.org/jira/browse/INFRA-11539
>>>> 
>>>> [3] http://joshua.incubator.apache.org
>>>> 
>>>> --
>>>> *Lewis*
>>> 
>>> 
>> 
>> 
>> --
>> *Lewis*
>> 
> 
> 
> 
> -- 
> *Lewis*



Re: [jira] [Assigned] (INFRA-11289) Load Git history for Joshua

2016-03-07 Thread Matt Post
Thanks, Tommaso. I just posted a question there about what this precisely this 
means.

Also, do we have a Joshua source code import yet? Can someone tell me what the 
new model is supposed to be for development? I am unclear on exactly how the 
relationship between Apache and Github code will work in practice.

(Please point me to a document somewhere if the response is RTFM :)

Thanks,
Matt


> On Mar 7, 2016, at 11:35 AM, Tommaso Teofili  
> wrote:
> 
> Just to get started, I've created
> https://issues.apache.org/jira/browse/JOSHUA-248.
> We should see it notified on this list.
> 
> Regards,
> Tommaso
> 
> Il giorno sab 5 mar 2016 alle ore 00:57 Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> ha scritto:
> 
>> NP Henry, I am gkad that we are now alive and kicking
>> 
>> On Fri, Mar 4, 2016 at 3:56 PM, Henry Saputra 
>> wrote:
>> 
>>> Thanks for the update, Lewis.
>>> 
>>> - Henry
>>> 
>>> On Fri, Mar 4, 2016 at 3:54 PM, Lewis John Mcgibbney <
>>> lewis.mcgibb...@gmail.com> wrote:
>>> 
 Afternoon troops,
 Ok infra are now getting to our tickets.
 I am monitoring it all  so I will see it through.
 
 
 -- Forwarded message --
 From: *Daniel Takamori (JIRA)* 
 Date: Friday, March 4, 2016
 Subject: [jira] [Assigned] (INFRA-11289) Load Git history for Joshua
 To: lewis.mcgibb...@gmail.com
 
 
 
 [
 
 
>>> 
>> https://issues.apache.org/jira/browse/INFRA-11289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]
 
 Daniel Takamori resolved INFRA-11289.
 -
Resolution: Fixed
 
> Load Git history for Joshua
> ---
> 
>Key: INFRA-11289
>URL:
>> https://issues.apache.org/jira/browse/INFRA-11289
>Project: Infrastructure
> Issue Type: Sub-task
> Components: Git
>   Reporter: Lewis John McGibbney
>   Assignee: Daniel Takamori
> 
> URL and of a repository or an export stream -
 https://github.com/joshua-decoder/joshua.git
> proof of IP rights -
 https://github.com/joshua-decoder/joshua/blob/master/LICENSE
> The above license and codebase relates to the Joshua proposal at
 https://wiki.apache.org/incubator/JoshuaProposal, and subsequent
>> RESULT
 thread referenced in the parent issue.
 
 
 
 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)
 
 
 
 --
 *Lewis*
 
>>> 
>> 
>> 
>> 
>> --
>> *Lewis*
>> 



Re: consolidating thread

2016-03-28 Thread Matt Post
Worked! Thank you!


> On Mar 29, 2016, at 12:10 AM, Mattmann, Chris A (3980) 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> So here’s the issue there is a lock b/c of waiting to load
> git history:
> 
> https://issues.apache.org/jira/browse/INFRA-11289
> 
> 
> I think they removed the lock so you can try.
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
> 
> 
> 
> 
> 
> -Original Message-
> From: jpluser <chris.a.mattm...@jpl.nasa.gov>
> Reply-To: "dev@joshua.incubator.apache.org"
> <dev@joshua.incubator.apache.org>
> Date: Monday, March 28, 2016 at 8:59 PM
> To: "dev@joshua.incubator.apache.org" <dev@joshua.incubator.apache.org>
> Subject: Re: consolidating thread
> 
>> Per ASF infra, Git is managed by having you as a member of the
>> incubator group, which I believe according to this you should
>> be able to commit. I’m talking to them in HipChat right now
>> on the infra channel trying to debug.
>> 
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>> 
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: Matt Post <p...@cs.jhu.edu>
>> Reply-To: "dev@joshua.incubator.apache.org"
>> <dev@joshua.incubator.apache.org>
>> Date: Monday, March 28, 2016 at 8:34 PM
>> To: "dev@joshua.incubator.apache.org" <dev@joshua.incubator.apache.org>
>> Subject: Re: consolidating thread
>> 
>>> mjpost
>>> 
>>> 
>>>> On Mar 28, 2016, at 11:33 PM, Mattmann, Chris A (3980)
>>>> <chris.a.mattm...@jpl.nasa.gov> wrote:
>>>> 
>>>> Matt, what’s your Apache username?
>>>> 
>>>> ++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattm...@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++
>>>> Director, Information Retrieval and Data Science Group (IRDS)
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> WWW: http://irds.usc.edu/
>>>> ++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -Original Message-
>>>> From: Matt Post <p...@cs.jhu.edu>
>>>> Reply-To: "dev@joshua.incubator.apache.org"
>>>> <dev@joshua.incubator.apache.org>
>>>> Date: Monday, March 28, 2016 at 8:31 PM
>>>> To: "dev@joshua.incubator.apache.org" <dev@joshua.incubator.apache.org>
>>>> Subject: Re: consolidating thread
>>>> 
>>>>>>> 
>>>>>>> (4)
>>>>>>> 
>>>>>>> 
>>>>>>> https://git-wip-us.apache.org/repos/asf?p=incubator-joshua.git;a=summ
>>>>>>> a
>&g

Re: Problem with git repo

2016-03-28 Thread Matt Post
Hi Daniel,

No worries, it didn't take too long to track down (I only started trying to 
push last night). Thanks for your help!

- matt

> On Mar 29, 2016, at 12:28 AM, Daniel Takamori  wrote:
> 
> Hey Joshua Team,
> Sorry about the recent confusion with the git repo; I was in charge of
> setting it up and I did not set the correct ticket status (WaitForUser) and
> instead closed the ticket when I needed someone to sign off that the repo
> had been created.  I apologize for the mistake and hope that it didn't
> cause too many problems in your work.
> 
> -Pono from Infrastructure



Re: Using Jira for Issues

2016-04-29 Thread Matt Post
Lewis, this sounds good to me.

I'm in the process of moving the (hideous) Joshua web page over to Confluence, 
and created a Developer page, where I added this to the documentation.

https://cwiki.apache.org/confluence/display/JOSHUA/Development

Can you look this over and improve it (e.g., with links on the appropriate 
instruction points?)

matt



> On Apr 29, 2016, at 7:13 AM, Lewis John Mcgibbney  
> wrote:
> 
> Hi Folks,
> One of the things about our Jira instance, is that it is hosted by and at
> the ASF. Therefore all correspondence is always available to the ASF.
> If Github were ever to vanish, we would essentially loose all of the
> correspondence for all of the tickets issues created over there.
> Typically what I, and every other Apache project I am aware of does, is to
> first open a ticket in Jira, then just title your pull request commit
> message after the Jira ticket.
> This way we also have comprehensive release reports, assignees, road maps,
> etc etc etc.
> I would like to suggest that we start using Jira in this manner as recently
> I've not really seen any tickets go in there.
> What do you think about this?
> Lewis
> 
> 
> -- 
> *Lewis*



Re: [jira] [Commented] (JOSHUA-253) Enable execution of Unit tests

2016-04-27 Thread Matt Post
I am fine with you just doing this. The current setup was a 
something-is-better-than-nothing (which is true) hack, and I'd be happy to have 
better practices pushed into the project. 

matt (from my phone)

> On Apr 27, 2016, at 2:39 PM, Kellen Sunderland (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260709#comment-15260709
>  ] 
> 
> Kellen Sunderland commented on JOSHUA-253:
> --
> 
> We've got a few unit tests we've created for Joshua, and we'd like to 
> eventually hook them into the Joshua build process.  
> 
> This is one topic I'd like to discuss at ApacheCon.  What I would like to 
> propose is to convert the current regression tests to be run by a unit test 
> runner (at the same time as the actual unit tests are run).  The main 
> advantage of having the regression tests runnable from a unit test runner is 
> that we'll be able to debug when there's a failure (this is quite tricky at 
> the moment).
> 
>> Enable execution of Unit tests
>> --
>> 
>>Key: JOSHUA-253
>>URL: https://issues.apache.org/jira/browse/JOSHUA-253
>>Project: Joshua
>> Issue Type: Test
>>   Affects Versions: 6.0
>>   Reporter: Lewis John McGibbney
>>Fix For: 6.1
>> 
>> 
>> As per our [discussion on this 
>> topic|http://www.mail-archive.com/dev%40joshua.incubator.apache.org/msg00270.html],
>>  [~teofili] correctly identified that unit level tests are not executed.
>> We need to fix this such that they are.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)



Re: joshua_api

2016-04-27 Thread Matt Post
Do you want me to fix the recapitalization? Or are you going to do that? I 
looked a bit, and it seems I'll have to add a method to get a word alignment 
object instead of just the string, so that I can poke through them. This 
approach is as good as true-casing in some languages.

A few other things:

- I saw a comment in the commit about the changes not working for phrase-based 
translation. Can you (or Felix) elaborate? What exactly will no longer work?

- Currently, there are multiple places where the "output-format" string has to 
get edited (KBestExtractor and in Translation). After you push your changes in, 
I'm going to make some edits so that this all occurs in one place.

matt


> On Apr 27, 2016, at 2:25 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Thanks for taking a look Matt,
> 
> I think this is all we've got planned as far as changes relating to an API
> would go.  We have a few more commits coming but they're just performance
> improvements and they don't change too much in the way of interfaces or
> method signatures.
> 
> -Kellen
> 
> On Wed, Apr 27, 2016 at 4:47 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Kellen,
>> 
>> Great. I had a chance to start looking over the ReworkedExtractions
>> branch. I'll have some more time today. It looks good to me so far. Is
>> there anything else you plan to do, or does that branch contain basically
>> all of it (apart from the recapitalization fix, which I see should be
>> applied more selectively, maybe only when a -recapitalize flag is present,
>> to save on time).
>> 
>> matt
>> 
>> 
>>> On Apr 26, 2016, at 1:56 AM, kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>>> 
>>> Hey Matt,
>>> 
>>> I've opened a new pull request with a few of our commits, feel free to
>> take
>>> a look when you have some time.
>>> 
>>> More importantly I've pushed our queue of upcoming commits to the
>> following
>>> branch in my fork:
>>> 
>> https://github.com/KellenSunderland/incubator-joshua/commits/ReworkedExtractions
>>> .  From there you can get an idea for the work we've done so far.  I
>>> haven't opened a PR yet for these commits because there's still some
>>> merging I have to do (there's a few failing tests and I had to
>> temporarily
>>> comment out some of your casing code).  Once that's fixed I'll do a
>> proper
>>> PR for these commits.
>>> 
>>> -Kellen
>>> 
>>> On Mon, Apr 25, 2016 at 1:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>>> Great. On that first point, I meant that translate() would return a
>>>> Translation object, which would know its hypergraph and could iterate
>> over
>>>> a KBestExtractor. In any case, though, it sounds like you are a bit
>> ahead
>>>> of me on this, so I'll wait for a push that I can see, and then we can
>>>> converge on the design.
>>>> 
>>>> matt
>>>> 
>>>> 
>>>>> On Apr 25, 2016, at 4:10 PM, Hieber, Felix <fhie...@amazon.de> wrote:
>>>>> 
>>>>> Hi Matt,
>>>>> 
>>>>> These are some nice suggestions. Most of the work we have done is in
>>>> line of what you propose so I would agree with Kellen that we should
>>>> synchronize and compare better earlier than later.
>>>>> 
>>>>> Best,
>>>>> Felix
>>>>> 
>>>>>> On 25.04.2016, at 07:44, kellen sunderland <
>> kellen.sunderl...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> Hey Matt,
>>>>>> 
>>>>>> Sorry for the late reply.  The Joshua-6 folder and tst may have just
>>>> been
>>>>>> artifacts of some symlinks I have locally.  Sorry they may have been
>>>> pushed
>>>>>> by mistake, I can clean that up.
>>>>>> 
>>>>>> Good idea to have the api code in a separate branch.  We can merge the
>>>> work
>>>>>> that we've done some time next week.
>>>>>> 
>>>>>> KBestExtractor is one of the things we want to return via the API.  We
>>>>>> already have some of this implemented though as you suggest.  I'll try
>>>> and
>>>>>> push the remaining work we've done into my github branch so you can
>>>> compare.
>>>>>> 
>>>>>> -Kellen
>>>>&g

Re: joshua_api

2016-04-27 Thread Matt Post
Sure thing, hope to tonight.

matt


> On Apr 27, 2016, at 6:41 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Hey Matt,
> 
> If you had time that would be fantastic.  I've created a new PR in case you
> want to pull it in.  There's actually 4 tests failing for me currently
> (casing issues causing at least one).  If you want to wait until we fix
> these tests that's also completely fine.
> 
> -Kellen
> 
> On Wed, Apr 27, 2016 at 11:32 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Do you want me to fix the recapitalization? Or are you going to do that? I
>> looked a bit, and it seems I'll have to add a method to get a word
>> alignment object instead of just the string, so that I can poke through
>> them. This approach is as good as true-casing in some languages.
>> 
>> A few other things:
>> 
>> - I saw a comment in the commit about the changes not working for
>> phrase-based translation. Can you (or Felix) elaborate? What exactly will
>> no longer work?
>> 
>> - Currently, there are multiple places where the "output-format" string
>> has to get edited (KBestExtractor and in Translation). After you push your
>> changes in, I'm going to make some edits so that this all occurs in one
>> place.
>> 
>> matt
>> 
>> 
>>> On Apr 27, 2016, at 2:25 PM, kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>>> 
>>> Thanks for taking a look Matt,
>>> 
>>> I think this is all we've got planned as far as changes relating to an
>> API
>>> would go.  We have a few more commits coming but they're just performance
>>> improvements and they don't change too much in the way of interfaces or
>>> method signatures.
>>> 
>>> -Kellen
>>> 
>>> On Wed, Apr 27, 2016 at 4:47 AM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>>> Kellen,
>>>> 
>>>> Great. I had a chance to start looking over the ReworkedExtractions
>>>> branch. I'll have some more time today. It looks good to me so far. Is
>>>> there anything else you plan to do, or does that branch contain
>> basically
>>>> all of it (apart from the recapitalization fix, which I see should be
>>>> applied more selectively, maybe only when a -recapitalize flag is
>> present,
>>>> to save on time).
>>>> 
>>>> matt
>>>> 
>>>> 
>>>>> On Apr 26, 2016, at 1:56 AM, kellen sunderland <
>>>> kellen.sunderl...@gmail.com> wrote:
>>>>> 
>>>>> Hey Matt,
>>>>> 
>>>>> I've opened a new pull request with a few of our commits, feel free to
>>>> take
>>>>> a look when you have some time.
>>>>> 
>>>>> More importantly I've pushed our queue of upcoming commits to the
>>>> following
>>>>> branch in my fork:
>>>>> 
>>>> 
>> https://github.com/KellenSunderland/incubator-joshua/commits/ReworkedExtractions
>>>>> .  From there you can get an idea for the work we've done so far.  I
>>>>> haven't opened a PR yet for these commits because there's still some
>>>>> merging I have to do (there's a few failing tests and I had to
>>>> temporarily
>>>>> comment out some of your casing code).  Once that's fixed I'll do a
>>>> proper
>>>>> PR for these commits.
>>>>> 
>>>>> -Kellen
>>>>> 
>>>>> On Mon, Apr 25, 2016 at 1:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>>>> 
>>>>>> Great. On that first point, I meant that translate() would return a
>>>>>> Translation object, which would know its hypergraph and could iterate
>>>> over
>>>>>> a KBestExtractor. In any case, though, it sounds like you are a bit
>>>> ahead
>>>>>> of me on this, so I'll wait for a push that I can see, and then we can
>>>>>> converge on the design.
>>>>>> 
>>>>>> matt
>>>>>> 
>>>>>> 
>>>>>>> On Apr 25, 2016, at 4:10 PM, Hieber, Felix <fhie...@amazon.de>
>> wrote:
>>>>>>> 
>>>>>>> Hi Matt,
>>>>>>> 
>>>>>>> These are some nice suggestions. Most of the work we have done is in
>>>>>> line of what you propose so I would agree with Kellen that we should
>>>>

Re: Language Pack size

2016-05-13 Thread Matt Post
Oh, yes, of course. That's in build_binary.


> On May 13, 2016, at 4:39 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Could we also use quantization with the language model to reduce the size?
> KenLM supports this right?
> 
> On Fri, May 13, 2016 at 1:19 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Great idea, hadn't thought of that.
>> 
>> I think we could also get some leverage out of:
>> 
>> - Reducing the language model to a 4-gram one
>> - Doing some filtering of the phrase table to reduce low-probability
>> translation options
>> 
>> These would be a bit lossier but I doubt it would matter much at all.
>> 
>> matt
>> 
>> 
>>> On May 13, 2016, at 4:02 PM, Tom Barber <t...@analytical-labs.com> wrote:
>>> 
>>> Out of curiosity more than anything else I tested XZ compression on a
>> model
>>> instead of Gzip, it takes the Spain pack down from 1.9GB to 1.5GB, not
>> the
>>> most ever, but obviously does mean 400MB+ less in remote storage and data
>>> going over the wire.
>>> 
>>> Worth considering I guess.
>>> 
>>> Tom
>>> --
>>> 
>>> Director Meteorite.bi - Saiku Analytics Founder
>>> Tel: +44(0)5603641316
>>> 
>>> (Thanks to the Saiku community we reached our Kickstart
>>> <
>> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
>>> 
>>> goal, but you can always help by sponsoring the project
>>> <http://www.meteorite.bi/products/saiku/sponsorship>)
>> 
>> 



Re: Language Pack size

2016-05-13 Thread Matt Post
Quantization is also supported in the grammar packer.

Another idea: since we know the model weights when we publish a language pack, 
we should pre-compute the dot product of the weight vector against the grammar 
weights and reduce it to a single (quantized) score.

(This would reduce the ability for users to play with the individual weights, 
but I don't think that's a huge loss, since the main weight is LM vs. TM).

matt


> On May 13, 2016, at 4:45 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> Oh, yes, of course. That's in build_binary.
> 
> 
>> On May 13, 2016, at 4:39 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
>> wrote:
>> 
>> Could we also use quantization with the language model to reduce the size?
>> KenLM supports this right?
>> 
>> On Fri, May 13, 2016 at 1:19 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>>> Great idea, hadn't thought of that.
>>> 
>>> I think we could also get some leverage out of:
>>> 
>>> - Reducing the language model to a 4-gram one
>>> - Doing some filtering of the phrase table to reduce low-probability
>>> translation options
>>> 
>>> These would be a bit lossier but I doubt it would matter much at all.
>>> 
>>> matt
>>> 
>>> 
>>>> On May 13, 2016, at 4:02 PM, Tom Barber <t...@analytical-labs.com> wrote:
>>>> 
>>>> Out of curiosity more than anything else I tested XZ compression on a
>>> model
>>>> instead of Gzip, it takes the Spain pack down from 1.9GB to 1.5GB, not
>>> the
>>>> most ever, but obviously does mean 400MB+ less in remote storage and data
>>>> going over the wire.
>>>> 
>>>> Worth considering I guess.
>>>> 
>>>> Tom
>>>> --
>>>> 
>>>> Director Meteorite.bi - Saiku Analytics Founder
>>>> Tel: +44(0)5603641316
>>>> 
>>>> (Thanks to the Saiku community we reached our Kickstart
>>>> <
>>> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
>>>> 
>>>> goal, but you can always help by sponsoring the project
>>>> <http://www.meteorite.bi/products/saiku/sponsorship>)
>>> 
>>> 
> 



Re: GIZA++ Licensing

2016-05-06 Thread Matt Post
I included a bunch of tools like GIZA a while back in order to make it easier 
for people to build systems. I think that's the wrong approach now, since we're 
focusing on providing black-box systems. So we should remove tools that aren't 
run-time dependencies, like GIZA, and just ask people to install them 
separately.



> On May 5, 2016, at 11:22 PM, Lewis John Mcgibbney  
> wrote:
> 
> Hi Folks,
> I am looking at the GIZA++ dependency currently packaged with the Joshua
> source [0]. This is GPL licensed [1]. We cannot package this under anything
> we distribute as an Apache release.
> Lewis
> 
> [0] https://github.com/apache/incubator-joshua/tree/master/ext/giza-pp
> [1]
> http://www.fjoch.com/giza-training-of-statistical-translation-models.html
> 
> -- 
> *Lewis*



Re: Podling Report Reminder - May 2016

2016-05-01 Thread Matt Post
Thanks, Lewis. I'll take a look at this by Weds.


> On May 1, 2016, at 4:16 PM, Lewis John Mcgibbney  
> wrote:
> 
> Hi Folks,
> Initial report populated as below
> 
> JoshuaJoshua is a statistical machine translation toolkitJoshua has
> been incubating since 2016-02-13.Three most important issues to
> address in the move towards graduation:  1. Ensure first release of
> Joshua Incubating artifacts (6.1)  2. Continue to build the Joshua
> PPMC and user community  3. Investigate targeted user communities
> within ApacheAny issues that the Incubator PMC (IPMC) or ASF Board
> wish/need to beaware of?NoneHow has the community developed since the
> last report?Joshua is being represented at ApacheCon. Currently one
> presentation is taking place in addition to a a meetup. How has the
> project developed since the last report?The project team have been
> working on development codeas well as official branding. All
> infrastructure has beenmigrated over to Apache. We are aiming for our
> first releasesome time after ApacheCon.Date of last release:  N/AWhen
> were the last committers or PMC members elected?Kellen Sunderland and
> Felix Hieber joined the PPMC andas committers on April 11th, 2016.
> Signed-off-by:  [ ](joshua) Paul Ramirez  [X](joshua) Lewis John
> McGibbney  [ ](joshua) Chris Mattmann  [ ](joshua) Tom Barber  [
> ](joshua) Henri Yandell
> 
> 
> 
> On Sun, May 1, 2016 at 6:10 AM,  wrote:
> 
>> Dear podling,
>> 
>> This email was sent by an automated system on behalf of the Apache
>> Incubator PMC. It is an initial reminder to give you plenty of time to
>> prepare your quarterly board report.
>> 
>> The board meeting is scheduled for Wed, 18 May 2016, 10:30 am PDT.
>> The report for your podling will form a part of the Incubator PMC
>> report. The Incubator PMC requires your report to be submitted 2 weeks
>> before the board meeting, to allow sufficient time for review and
>> submission (Wed, May 4th).
>> 
>> Please submit your report with sufficient time to allow the Incubator
>> PMC, and subsequently board members to review and digest. Again, the
>> very latest you should submit your report is 2 weeks prior to the board
>> meeting.
>> 
>> Thanks,
>> 
>> The Apache Incubator PMC
>> 
>> Submitting your Report
>> 
>> --
>> 
>> Your report should contain the following:
>> 
>> *   Your project name
>> *   A brief description of your project, which assumes no knowledge of
>>the project or necessarily of its field
>> *   A list of the three most important issues to address in the move
>>towards graduation.
>> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>>aware of
>> *   How has the community developed since the last report
>> *   How has the project developed since the last report.
>> 
>> This should be appended to the Incubator Wiki page at:
>> 
>> http://wiki.apache.org/incubator/May2016
>> 
>> Note: This is manually populated. You may need to wait a little before
>> this page is created from a template.
>> 
>> Mentors
>> ---
>> 
>> Mentors should review reports for their project(s) and sign them off on
>> the Incubator wiki page. Signing off reports shows that you are
>> following the project - projects that are not signed may raise alarms
>> for the Incubator PMC.
>> 
>> Incubator PMC
>> 
> 
> 
> 
> -- 
> *Lewis*



Fwd: Blogging opportunities on Linux.com and OpenSource.com

2016-04-18 Thread Matt Post
This would be a great thing to do once we push out the first release. I'd be 
happy to participate if anyone wants to team up.


> Begin forwarded message:
> 
> From: Sally Khudairi 
> Subject: Blogging opportunities on Linux.com and OpenSource.com
> Date: April 17, 2016 at 8:56:55 PM EDT
> To: Committers 
> Reply-To: Sally Khudairi 
> 
> Hello Apache Committers! 
> 
> As the Apache community is widely recognized for it's diversity, innovation, 
> and expertise, we have an open invitation for Apache Committers to blog for 
> Linux.com and OpenSource.com. 
> 
> Topics can range from specific Apache projects (top-level features/benefits/* 
> --not at the developer/technical/coding level) to ApacheCon (why you're 
> attending/what you're presenting) to The Apache Way (process/methodologies/*) 
> to general Open Source (theory/innovation/opportunities/challenges), but 
> _NOT_ commercial promotion. 
> 
> OpenSource.com have an editorial calendar [1] that might be helpful in 
> sparking ideas for either outlet. 
> 
> Please let me know if you're interested (specify which blog --or both!) and 
> I'll be happy to connect you. 
> 
> Warm thanks, 
> Sally 
> 
> [1] https://opensource.com/resources/editorial-calendar 
> 
> = = = = = 
> vox +1 617 921 8656 
> gvox +1 646 598 4616 
> skype sallykhudairi



Re: http://joshua.incubator.apache.org/

2016-04-14 Thread Matt Post
Hi Igor,

Yes, here's the ticket in case that is helpful to you:


https://issues.apache.org/jira/browse/INFRA-11295?jql=project%20%3D%20INFRA%20AND%20text%20~%20%22Joshua%22

matt


> On Apr 13, 2016, at 7:20 PM, Tom Barber  wrote:
> 
> Hi Igor
> 
> I believe in our case we got infra to setup a gitsubpub config for us.
> 
> Tom
> On 14 Apr 2016 00:06, "Igor Katkov"  wrote:
> 
>> Hi fellow Apache incubation project devs,
>> 
>> My name is Igor Katkov, dev@Omid
>> http://incubator.apache.org/projects/omid.html
>> 
>> I'd really appreciate if someone who put up
>> http://joshua.incubator.apache.org or knows how to do it, advise me.
>> 
>> *Question*: how to set-up the website? I can generate static html no
>> problem, but where/how do I publish them?
>> 



Re: tests are not run using latest code

2016-04-14 Thread Matt Post
"ant test" runs the regression tests under test/ (it runs 
test/run-all-tests.sh, which looks for all scripts underneath test of the form 
test.sh, and executes them).

No unit tests are currently run. This is obviously broken.

matt


> On Apr 12, 2016, at 8:02 PM, Lewis John Mcgibbney  
> wrote:
> 
> I thought it was the unit tests that were being run!
> If not then we need to log a Jira ticket and make sure that they are.
> Lewis
> 
> On Tue, Apr 12, 2016 at 7:31 AM, Tommaso Teofili 
> wrote:
> 
>> Il giorno mar 12 apr 2016 alle ore 15:10 Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> ha scritto:
>> 
>>> So you are building with Macen Tommaso? The Maven build is not
>> functional.
>>> 
>> 
>> I have committed a couple of fixes so that by now running mvn clean install
>> works (but no tests are run).
>> 
>> 
>>> Instead if you build with Ant it does build the tests and then run them.
>>> 
>> 
>> the thing is that the Ant tests execute the bash scripts which are
>> integration tests in the end, while I was expecting the java unit tests to
>> be executed (too); I think those should work with Ant too.
>> 
>> Regards,
>> Tommaso
>> 
>> 
>>> Lewis
>>> 
>>> On Tuesday, April 12, 2016, Tommaso Teofili 
>>> wrote:
>>> 
 Hi all,
 
 while having a look at [1], I have realized that current tests in
>> Joshua
>>> do
 not run against latest source code (I think), in fact the test
>>> compilation
 fails (with Maven) if I just set the test directory.
 I think that, besides the Ant vs Maven thing, it'd be really good if we
 could use latest code when performing tests.
 Or am I missing something ?
 
 Regards,
 Tommaso
 
 [1] : https://issues.apache.org/jira/browse/JOSHUA-252
 
>>> 
>>> 
>>> --
>>> *Lewis*
>>> 
>> 
> 
> 
> 
> -- 
> *Lewis*



Re: Avoiding master failures with CI

2016-07-13 Thread Matt Post
I misread the day, here, and thought you meant today. I can't do tomorrow 
afternoon, but that time on Friday works for me. We could also go into next 
week if that's better.


> On Jul 13, 2016, at 9:41 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> That works for me. I've watched the video you linked so I have a feel for 
> this, but I still think it'd be good to chat.
> 
> matt
> 
> 
>> On Jul 13, 2016, at 8:21 AM, kellen sunderland <kellen.sunderl...@gmail.com> 
>> wrote:
>> 
>> Would everyone be ok with tomorrow at 5PM UTC?
>> 
>> -Kellen
>> 
>> On Tue, Jul 12, 2016 at 8:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>>> Hi Kellen,
>>> 
>>> No worries, and you did provide a link. I think a Google Hangouts
>>> walkthrough would be an efficient way to go about this. What day and time
>>> work for you? I am mostly open this week.
>>> 
>>> matt
>>> 
>>> 
>>>> On Jul 11, 2016, at 6:50 PM, kellen sunderland <
>>> kellen.sunderl...@gmail.com> wrote:
>>>> 
>>>> Sorry should have provided the link to this page: https://travis-ci.org/
>>> .
>>>> If you scroll down a bit on that page there's a Pull Request flow
>>> section,
>>>> it's the flow I'd be most in favour of.  There's also a decent (but
>>> rushed)
>>>> demo here: https://www.youtube.com/watch?v=Uft5KBimzyk .  We actually
>>> don't
>>>> need to do a lot of the work that he demos, i.e. no node or gulp
>>>> configuration.  Our setup is close enough to default a default java
>>> project
>>>> that we just have to tell it to build java 8 and then it runs maven
>>>> properly.
>>>> 
>>>> Using a CI server would have some aspects that are similar to the
>>> branching
>>>> document you mention, and some benefits that are a bit orthogonal.  Most
>>> of
>>>> these benefits have to do with unit testing, which isn't covered in the
>>> doc.
>>>> 
>>>> First the orthogonal benefits:  The main benefit we would get from using
>>> CI
>>>> is that we guarantee code in our repo is never broken.  That is to say
>>>> tests always pass and it always builds correctly.  CI servers are really
>>>> useful to prevent problems where one developer may have everything
>>> working
>>>> properly on his/her machine, but when they later realize it's not working
>>>> on another devs machine.  A good example of this is the
>>> class-based-lm-test
>>>> we pushed recently.  It works fine for me locally but it would fail for
>>>> anyone without kenlm.so.  There are many other examples (javadoc errors,
>>>> code style, etc) but what will happen in these cases is we'll see a big
>>>> obvious 'The build has problems' message in the PR page on Github.  If
>>> the
>>>> CI server runs of all of our code quality checks and finds that
>>> everything
>>>> is good we'll get a big 'This PR is ready to merge' message.
>>>> 
>>>> Now to the part that overlaps a bit with branching.  There are various
>>>> branching strategies that we could adopt for the project.  The master /
>>> dev
>>>> branch one is a possibility.  I'd suggest we try commit code strictly in
>>>> PRs rather than pushing to git.  This would be the equivalent of feature
>>>> branching from your link.  The reason I'd suggest that approach is that
>>>> from what I've seen it'll be dead simple to get working with Github and
>>>> Travis, and it gives us the same goal of having a stable master branch.
>>>> 
>>>> If you'd like we can walk through setting this up together on a forked
>>>> version of our Github repo.  We could do a quick example of how code
>>> would
>>>> be pushed and merged.  I should be available for a google hangout some
>>> time
>>>> this week if that works for you?
>>>> 
>>>> -Kellen
>>>> 
>>>> 
>>>> On Mon, Jul 11, 2016 at 10:51 PM, Mattmann, Chris A (3980) <
>>>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>>> 
>>>>> CI = continuous integration :)
>>>>> 
>>>>> ++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>

master pushes

2016-07-28 Thread Matt Post
Hi folks,

Sorry for the continued pushes to master. We have had Travis-CI enabled, but I 
haven't taken the time to get it setup. Someone else should feel free to take 
charge, here; otherwise, I hope to have time to do this after my workshop is 
done, at the end of next week.

matt

Re: Podling Report Reminder - August 2016

2016-08-01 Thread Matt Post
Hi folks,

I just loaded this with a draft. Comments / unilateral changes will not meet 
resistance from me.
--
Joshua

Joshua is a statistical machine translation toolkit

Joshua has been incubating since 2016-02-13.

Three most important issues to address in the move towards graduation:

  1. Ensure first release of Joshua Incubating artifacts (6.1)
  2. Continue to build the Joshua PPMC and user community
  3. Investigate targeted user communities within Apache

Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be
aware of?

  None.

How has the community developed since the last report?

  We have gained a new contributor, and have continued developing and
  updating the web page to increase interest. We have not made
  any real advertising or publicity pushes, but hope to around the
  time of our first formal release under the Apache banner (targeted
  for September).

How has the project developed since the last report?

  We have been steadily pushing up stability and design improvements,
  and are also deep in discussion about further changes. We have
  made some changes to our team infrastructure, including enabling
  Travis-CI for continual integration testing.

Date of last release:

  N/A

When were the last committers or PMC members elected?

  April 11, 2016 (Kellen Sunderland and Felix Hieber)

Signed-off-by:

  [ ](joshua) Paul Ramirez
  [ ](joshua) Lewis John McGibbney
  [ ](joshua) Chris Mattmann
  [ ](joshua) Tom Barber
  [ ](joshua) Henri Yandell

Shepherd/Mentor notes:
--
matt


> On Jul 31, 2016, at 9:15 AM, johndam...@apache.org wrote:
> 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 17 August 2016, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, August 03).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
>the project or necessarily of its field
> *   A list of the three most important issues to address in the move
>towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> 
> This should be appended to the Incubator Wiki page at:
> 
> http://wiki.apache.org/incubator/August2016
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC



Re: [GitHub] incubator-joshua issue #33: Refactored unit tests to all use TestNG, removed...

2016-07-31 Thread Matt Post
Hi Kellen,

The current standard location for KenLM is $JOSHUA/lib. I'm happy to move this 
if there is a more conventional spot. $JOSHUA/target?

matt



> On Jul 29, 2016, at 3:47 AM, KellenSunderland  wrote:
> 
> Github user KellenSunderland commented on the issue:
> 
>https://github.com/apache/incubator-joshua/pull/33
> 
>Hey Matt, thanks for letting me know.  I missed adding a resource 
> directory (I always forget to git add dirs for some reason).  Things are 
> working for me locally now.
> 
>I had one small question: where should we point the test runner to look 
> for libkenlm.so?  At the moment I chose the rootdir/lib as it seems to be 
> auto-created when the ./downloaddeps script is run, but maybe there's a 
> better location?
> 
>Here's what I've run to verify the build locally:
> 
>#no KenLM
>export 
> JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/
>export JOSHUA=`pwd`
>git clean -xdf
>mvn test
>Tests run: 120, Failures: 0, Errors: 0, Skipped: 10
>[INFO] BUILD SUCCESS
> 
>#no KenLM/ext
>git clean -xdf 
>rm -rf ext
>mvn test
>Tests run: 122, Failures: 0, Errors: 0, Skipped: 12, Time elapsed: 1.961 
> sec - in TestSuite
>[INFO] BUILD SUCCESS
> 
>#Fresh KenLM
>git clean -xdf
>rm -rf ext
>./download-deps.sh
>mvn test
>Tests run: 116, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.689 
> sec - in TestSuite
>[INFO] BUILD SUCCESS
> 
>-Kellen
> 
> 
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---



Re: Podling Report Reminder - August 2016

2016-08-02 Thread Matt Post
Okay, I added this.

If things go as planned we'll have a nice release to mention in the next report.

matt


> On Aug 1, 2016, at 2:56 PM, Henry Saputra <henry.sapu...@gmail.com> wrote:
> 
> I think it would be great to mention it.
> 
> Releases are key growth indicator for podling
> 
> - Henry
> 
> On Mon, Aug 1, 2016 at 9:08 AM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> 
>> Looks good to me.  Should we mention that we're planning a release that
>> moves our build system to maven?
>> 
>> On Mon, Aug 1, 2016 at 3:10 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>>> Hi folks,
>>> 
>>> I just loaded this with a draft. Comments / unilateral changes will not
>>> meet resistance from me.
>>> --
>>> Joshua
>>> 
>>> Joshua is a statistical machine translation toolkit
>>> 
>>> Joshua has been incubating since 2016-02-13.
>>> 
>>> Three most important issues to address in the move towards graduation:
>>> 
>>>  1. Ensure first release of Joshua Incubating artifacts (6.1)
>>>  2. Continue to build the Joshua PPMC and user community
>>>  3. Investigate targeted user communities within Apache
>>> 
>>> Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be
>>> aware of?
>>> 
>>>  None.
>>> 
>>> How has the community developed since the last report?
>>> 
>>>  We have gained a new contributor, and have continued developing and
>>>  updating the web page to increase interest. We have not made
>>>  any real advertising or publicity pushes, but hope to around the
>>>  time of our first formal release under the Apache banner (targeted
>>>  for September).
>>> 
>>> How has the project developed since the last report?
>>> 
>>>  We have been steadily pushing up stability and design improvements,
>>>  and are also deep in discussion about further changes. We have
>>>  made some changes to our team infrastructure, including enabling
>>>  Travis-CI for continual integration testing.
>>> 
>>> Date of last release:
>>> 
>>>  N/A
>>> 
>>> When were the last committers or PMC members elected?
>>> 
>>>  April 11, 2016 (Kellen Sunderland and Felix Hieber)
>>> 
>>> Signed-off-by:
>>> 
>>>  [ ](joshua) Paul Ramirez
>>>  [ ](joshua) Lewis John McGibbney
>>>  [ ](joshua) Chris Mattmann
>>>  [ ](joshua) Tom Barber
>>>  [ ](joshua) Henri Yandell
>>> 
>>> Shepherd/Mentor notes:
>>> --
>>> matt
>>> 
>>> 
>>>> On Jul 31, 2016, at 9:15 AM, johndam...@apache.org wrote:
>>>> 
>>>> Dear podling,
>>>> 
>>>> This email was sent by an automated system on behalf of the Apache
>>>> Incubator PMC. It is an initial reminder to give you plenty of time to
>>>> prepare your quarterly board report.
>>>> 
>>>> The board meeting is scheduled for Wed, 17 August 2016, 10:30 am PDT.
>>>> The report for your podling will form a part of the Incubator PMC
>>>> report. The Incubator PMC requires your report to be submitted 2 weeks
>>>> before the board meeting, to allow sufficient time for review and
>>>> submission (Wed, August 03).
>>>> 
>>>> Please submit your report with sufficient time to allow the Incubator
>>>> PMC, and subsequently board members to review and digest. Again, the
>>>> very latest you should submit your report is 2 weeks prior to the board
>>>> meeting.
>>>> 
>>>> Thanks,
>>>> 
>>>> The Apache Incubator PMC
>>>> 
>>>> Submitting your Report
>>>> 
>>>> --
>>>> 
>>>> Your report should contain the following:
>>>> 
>>>> *   Your project name
>>>> *   A brief description of your project, which assumes no knowledge of
>>>>   the project or necessarily of its field
>>>> *   A list of the three most important issues to address in the move
>>>>   towards graduation.
>>>> *   Any issues that the Incubator PMC or ASF Board might wish/need to
>> be
>>>>   aware of
>>>> *   How has the community developed since the last report
>>>> *   How has the project developed since the last report.
>>>> 
>>>> This should be appended to the Incubator Wiki page at:
>>>> 
>>>> http://wiki.apache.org/incubator/August2016
>>>> 
>>>> Note: This is manually populated. You may need to wait a little before
>>>> this page is created from a template.
>>>> 
>>>> Mentors
>>>> ---
>>>> 
>>>> Mentors should review reports for their project(s) and sign them off on
>>>> the Incubator wiki page. Signing off reports shows that you are
>>>> following the project - projects that are not signed may raise alarms
>>>> for the Incubator PMC.
>>>> 
>>>> Incubator PMC
>>> 
>>> 
>> 



Re: Language Pack English-Japanese

2016-08-04 Thread Matt Post
Hi Toshiki,

Have you been able to gather any parallel data?

matt


> On Jul 22, 2016, at 3:50 PM, Henry Saputra <henry.sapu...@gmail.com> wrote:
> 
> HI Toshiki,
> 
> For this kind of discussion, let's have it in the dev@ list.
> 
> You can ask the question to dev@joshua.incubator.apache.org.
> 
> Thanks,
> 
> Henry
> 
> On Thu, Jul 21, 2016 at 9:46 PM, IGA Tosiki <igap...@gmail.com> wrote:
> 
>> Hi Matt,
>> 
>> Thanks for your reply!
>> 
>> I'm happy to read your mail, I want to help you Japanese-English language
>> pack.
>> And YES, I mean translation memories by TMS/XLIFF. But I may convert
>> TMS to what you specified format.
>> 
>> And also I knew English to Japanese is very difficult, but also I
>> believe sample of English-Japanese language pack will attract many
>> Japanese people to use Joshua.
>> 
>> Regards,
>> Toshiki
>> 
>> 2016-07-22 12:42 GMT+09:00 Matt Post <p...@cs.jhu.edu>:
>>> Hi,
>>> 
>>> There is no Japanese--English language pack, but I would be happy to
>> build one if you could help by pointing me to data. What we need is
>> parallel data in the form of sentences that are translations of each other.
>> If you have access to this or pointers to where I could find some, I would
>> be happy to build it. There are likely standard datasets available; people
>> like Graham Neubig (http://www.phontron.com) have been working on this
>> for a while.
>>> 
>>> What are TMS and LTIFF? Are you talking about translation memories?
>>> 
>>> As a side note, translation between English and Japanese is very
>> difficult and tends not to be very good. One approach that helps is
>> translating from trees and forests. Joshua does not have this capability at
>> the moment.
>>> 
>>> Sincerely,
>>> matt
>>> 
>>> 
>>>> On Jul 21, 2016, at 11:28 PM, IGA Tosiki <igap...@gmail.com> wrote:
>>>> 
>>>> Hi team,
>>>> 
>>>> I got interest about Joshua, and language pack. I am Japanese, and I
>>>> want to know around Japanese language pack.
>>>> 
>>>> Is there any plan about building Japanese-English language pack?
>>>> I believe TMS or LTIFF will usefull to building such language pack. I
>>>> have many OSS based TMS between English-Japanese. Is there any path
>>>> using TMX or LTIFF for input of Joshua language pack?
>>>> 
>>>> Best regards,
>>>> Toshiki Iga
>>> 
>> 



Re: Issue Building LM on master branch

2016-07-17 Thread Matt Post
Lewis — This is a good-sized dataset, and on a single desktop machine, I expect 
it would take at least a day to go all the way through alignment, 
model-building, and tuning.

fast_align is a good idea, though it isn't integrated into the pipeline 
(shouldn't be too hard, and is on the list). You could also just try "--aligner 
berkeley" and see if that works. 

Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? Sometimes 
GIZA doesn't compile correctly, and this could be an error where it doesn't 
find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).

matt


> On Jul 16, 2016, at 6:01 PM, Lewis John Mcgibbney  
> wrote:
> 
> Hi Folks,
> When attempting to build a heiro model using 5K sentences for tuning, many
> many more than that for testing and again many many more than that for the
> actual corpus (~880K) I get the following error within the GIZA alignment
> pipeline phase.
> 
> Anyone have a clue what this means? I have the full GIZA logs if they are
> useful.
> I did find a thread on a VERY similar issue at [0]. The solution seems to
> be to use absolute paths to all input data for the pipeline however that is
> exactly what I've done e.g.
> 
> $JOSHUA/bin/pipeline.pl  --rundir . --type hiero --corpus
> /usr/local/joshua_input/commoncrawl.ru-en --tune
> /usr/local/joshua_input/commoncrawl.ru-en.tune --test
> /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru
> --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to
> English Translation model” --mbr
> 
> Where the parallel .en and .ru sentence files exist for all of the above
> corpus, tune and test paths respectively.
> 
> [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489
> 
> I have been having trouble consistently when generating models using
> GIZA... is there a suggested alignment substitute which I should be trying
> out?
> 
> One last question... roughly how long should a Hiero-based LM for a corpus
> of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB
> mem. I remeber reading a while ago on the old Joshua site that a pipeline
> would run in 10 or so minutes... this is clearly not the case and I would
> like to share/compare some results if possible with others who are in the
> business of generating LM and language packs.
> 
> Thanks
> 
> ==
> Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz
> Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final
> Waiting for second GIZA process...
> (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016
> Combining forward and inverted alignment from files:
>  alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz}
>  alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz}
> Executing: bash -c mkdir -p alignments/0/model
> Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d
> <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
> |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0)
> skip=<0> counts=<817962>
> symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250:
> pointer being freed was not allocated
> *** set a breakpoint in malloc_error_break to debug
> bash: line 1:  9080 Done
> /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd
> alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
>  9081 Abort trap: 6   |
> /usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> Exit code: 134
> ERROR: Can't generate symmetrized alignment file
> 
> 
> 
> -- 
> *Lewis*



Re: Avoiding master failures with CI

2016-07-18 Thread Matt Post
Thanks, done.

https://issues.apache.org/jira/browse/INFRA-12301?jql=project%20%3D%20INFRA


> On Jul 15, 2016, at 6:31 PM, Mattmann, Chris A (3980) 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> Hey Matt,
> 
> Apache infra supports Travis CI - just file a ticket and they will
> set it up :)
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 7/15/16, 2:05 PM, "Matt Post" <p...@cs.jhu.edu> wrote:
> 
>> Question for Chris and/or Lewis:
>> 
>> So, Kellen and I took a look at this today, and it looks like a good 
>> solution. The problem is that it integrates with projects hosted on Github 
>> that you have write access to. In order to make use of this, we'd need to 
>> rearrange the setup we have.
>> 
>> Currently, we push to a repo at git.apache.org, and that is then pushed down 
>> to github.com/apache/incubator-joshua. This lets us use the Github repo for 
>> receiving things like pull requests and so on, but we do not have write 
>> access to it, so merges and so on have to be handled manually.
>> 
>> To use Travis-ci, we'd need to re-enginneer this. Apache would need to give 
>> us write access to github.com/apache/incubator-joshua, or we'd need to use 
>> another official host for Joshua. We'd then use git.apache.org as the 
>> mirror, instead of the other way around.
>> 
>> Is there any way that this could be done? I understand Apache's arguments 
>> about keeping discussions at home, since github may not last forever. 
>> However, it seems like we could do this if we use git.apache.org as the 
>> backup mirror, and continue to use JIRA for discussions and so on. In 
>> general, Github has a lot of tools that could help with development. It 
>> would be nice if we could make use of them while still checking off Apache's 
>> logging requirements.
>> 
>> matt
>> 
>> 
>> 
>>> On Jul 11, 2016, at 6:50 PM, kellen sunderland 
>>> <kellen.sunderl...@gmail.com> wrote:
>>> 
>>> Sorry should have provided the link to this page: https://travis-ci.org/ .
>>> If you scroll down a bit on that page there's a Pull Request flow section,
>>> it's the flow I'd be most in favour of.  There's also a decent (but rushed)
>>> demo here: https://www.youtube.com/watch?v=Uft5KBimzyk .  We actually don't
>>> need to do a lot of the work that he demos, i.e. no node or gulp
>>> configuration.  Our setup is close enough to default a default java project
>>> that we just have to tell it to build java 8 and then it runs maven
>>> properly.
>>> 
>>> Using a CI server would have some aspects that are similar to the branching
>>> document you mention, and some benefits that are a bit orthogonal.  Most of
>>> these benefits have to do with unit testing, which isn't covered in the doc.
>>> 
>>> First the orthogonal benefits:  The main benefit we would get from using CI
>>> is that we guarantee code in our repo is never broken.  That is to say
>>> tests always pass and it always builds correctly.  CI servers are really
>>> useful to prevent problems where one developer may have everything working
>>> properly on his/her machine, but when they later realize it's not working
>>> on another devs machine.  A good example of this is the class-based-lm-test
>>> we pushed recently.  It works fine for me locally but it would fail for
>>> anyone without kenlm.so.  There are many other examples (javadoc errors,
>>> code style, etc) but what will happen in these cases is we'll see a big
>>> obvious 'The build has problems' message in the PR page on Github.  If the
>>> CI server runs of all of our code quality checks and finds that everything
>>> is good we'll get a big 'This PR is ready to merge' message.
>>> 
>>> Now to the part that overlaps a bit with branching.  There are various
>

Re: Russian Language Model for Joshua

2016-07-15 Thread Matt Post
All right, started trying to recompile. If you have a machine with > 256 GB of 
memory, it might be more efficient for me to give you the raw ARPA file and for 
you to compile it. We'll see how it goes. Ping me in a day if you don't hear 
from me.

matt


> On Jul 15, 2016, at 4:40 PM, Mattmann, Chris A (3980) 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> Yes please! :)
> 
> Sent from my iPhone
> 
>> On Jul 15, 2016, at 1:39 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>> I have one built on Common Crawl. It's 25 GB uncompressed. My KenLM compiles 
>> of it failed in the past, but I'll try again. I expect it to be about 8 GB 
>> when that's done. Do you want it?
>> 
>> matt
>> 
>> 
>>> On Jul 15, 2016, at 3:50 PM, Mattmann, Chris A (3980) 
>>> <chris.a.mattm...@jpl.nasa.gov> wrote:
>>> 
>>> Hey Folks,
>>> 
>>> Anyone have a Russian Language Model for Joshua? Lewis was working on
>>> one, not sure if he has it but just broadening the question.
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++
>>> Director, Information Retrieval and Data Science Group (IRDS)
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> WWW: http://irds.usc.edu/
>>> ++
>> 



Re: Avoiding master failures with CI

2016-07-15 Thread Matt Post
Question for Chris and/or Lewis:

So, Kellen and I took a look at this today, and it looks like a good solution. 
The problem is that it integrates with projects hosted on Github that you have 
write access to. In order to make use of this, we'd need to rearrange the setup 
we have.

Currently, we push to a repo at git.apache.org, and that is then pushed down to 
github.com/apache/incubator-joshua. This lets us use the Github repo for 
receiving things like pull requests and so on, but we do not have write access 
to it, so merges and so on have to be handled manually.

To use Travis-ci, we'd need to re-enginneer this. Apache would need to give us 
write access to github.com/apache/incubator-joshua, or we'd need to use another 
official host for Joshua. We'd then use git.apache.org as the mirror, instead 
of the other way around.

Is there any way that this could be done? I understand Apache's arguments about 
keeping discussions at home, since github may not last forever. However, it 
seems like we could do this if we use git.apache.org as the backup mirror, and 
continue to use JIRA for discussions and so on. In general, Github has a lot of 
tools that could help with development. It would be nice if we could make use 
of them while still checking off Apache's logging requirements.

matt



> On Jul 11, 2016, at 6:50 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Sorry should have provided the link to this page: https://travis-ci.org/ .
> If you scroll down a bit on that page there's a Pull Request flow section,
> it's the flow I'd be most in favour of.  There's also a decent (but rushed)
> demo here: https://www.youtube.com/watch?v=Uft5KBimzyk .  We actually don't
> need to do a lot of the work that he demos, i.e. no node or gulp
> configuration.  Our setup is close enough to default a default java project
> that we just have to tell it to build java 8 and then it runs maven
> properly.
> 
> Using a CI server would have some aspects that are similar to the branching
> document you mention, and some benefits that are a bit orthogonal.  Most of
> these benefits have to do with unit testing, which isn't covered in the doc.
> 
> First the orthogonal benefits:  The main benefit we would get from using CI
> is that we guarantee code in our repo is never broken.  That is to say
> tests always pass and it always builds correctly.  CI servers are really
> useful to prevent problems where one developer may have everything working
> properly on his/her machine, but when they later realize it's not working
> on another devs machine.  A good example of this is the class-based-lm-test
> we pushed recently.  It works fine for me locally but it would fail for
> anyone without kenlm.so.  There are many other examples (javadoc errors,
> code style, etc) but what will happen in these cases is we'll see a big
> obvious 'The build has problems' message in the PR page on Github.  If the
> CI server runs of all of our code quality checks and finds that everything
> is good we'll get a big 'This PR is ready to merge' message.
> 
> Now to the part that overlaps a bit with branching.  There are various
> branching strategies that we could adopt for the project.  The master / dev
> branch one is a possibility.  I'd suggest we try commit code strictly in
> PRs rather than pushing to git.  This would be the equivalent of feature
> branching from your link.  The reason I'd suggest that approach is that
> from what I've seen it'll be dead simple to get working with Github and
> Travis, and it gives us the same goal of having a stable master branch.
> 
> If you'd like we can walk through setting this up together on a forked
> version of our Github repo.  We could do a quick example of how code would
> be pushed and merged.  I should be available for a google hangout some time
> this week if that works for you?
> 
> -Kellen
> 
> 
> On Mon, Jul 11, 2016 at 10:51 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> 
>> CI = continuous integration :)
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> +

Re: joshua - Build # 49 - Still Failing!

2016-07-13 Thread Matt Post
Minor point, but are there any objections to changing this to 6.1-SNAPSHOT?

matt


> On Jul 13, 2016, at 8:45 AM, kellen sunderland  
> wrote:
> 
> Ahh, ok.  I guess I'll just keep an eye on it.  Thanks Tom (and thanks for
> doing the work to set this up).
> 
> On Wed, Jul 13, 2016 at 2:43 PM, Tom Barber  wrote:
> 
>> That one did you are correct, but a few builds later it reverted back to
>> fine, so whatever it was was transient I guess.
>> 
>> --
>> 
>> Director Meteorite.bi - Saiku Analytics Founder
>> Tel: +44(0)5603641316
>> 
>> (Thanks to the Saiku community we reached our Kickstart
>> <
>> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
>>> 
>> goal, but you can always help by sponsoring the project
>> )
>> 
>> On 13 July 2016 at 13:41, kellen sunderland 
>> wrote:
>> 
>>> To me it reads as if it failed when trying to upload.
>>> 
>>> [WARNING] *** CHECKSUM FAILED - Checksum failed on download: local =
>>> 'feabc96bb65f9ea4da42af561362d0f429ea7ded'; remote =
>>> '1252f3767e96442e19af8fb760ed07156f4a70cc' - RETRYING[WARNING] ***
>>> CHECKSUM FAILED - Checksum failed on download: local =
>>> 'feabc96bb65f9ea4da42af561362d0f429ea7ded'; remote =
>>> '1252f3767e96442e19af8fb760ed07156f4a70cc' - IGNORINGUploading:
>>> 
>>> 
>> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar
>>> 4/1118K
>>> 8/1118K
>>> 12/1118K
>>> 16/1118K
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 1116/1118K
>>> 1118/1118K
>>> [INFO]
>>> 
>> [ERROR]
>>> BUILD ERROR[INFO]
>>> 
>>> [INFO] Error deploying artifact: Failed to transfer file:
>>> 
>>> 
>> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar
>>> .
>>> Return code is: 401
>>> 
>>> 
>>> 
>>> -Kellen
>>> 
>>> 
>>> On Wed, Jul 13, 2016 at 2:24 PM, Tom Barber 
>>> wrote:
>>> 
 
 
>>> 
>> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/
 
 Snapshots are uploading, its just missing the version you're looking
>> for.
 
 On Wed, Jul 13, 2016 at 1:20 PM, kellen sunderland <
 kellen.sunderl...@gmail.com> wrote:
 
> Strange, https://builds.apache.org/job/joshua_master/78/ is passing.
> Looks
> like the analysis CI is getting an 401 when trying to upload a build
> artifact to here:
> 
> 
> 
 
>>> 
>> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar
> 
> 
> Anyone know who has admin access on this CI server?  I think we might
 need
> to double check the auth settings for this step.
> 
> -Kellen
> 
> 
> --
> 
> joshua - Build # 49 - Still Failing:
> 
> 
> Check console output at
 https://analysis.apache.org/jenkins/job/joshua/49/
> to view the results.
> 
 
>>> 
>> 



bigtranslate

2016-07-13 Thread Matt Post
Chris,

This looks cool. How are you planning to get this to work with Joshua? Do you 
need help with the API piece?

matt


> On Jul 12, 2016, at 6:40 PM, Mattmann, Chris A (3980) 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> I will see about registering as well :)
> 
> I have BigTranslate up and working if anyone is interested. I am
> currently evaluating it on the XDATA employment corpus with Lingo24
> but next is Joshua (and hoping to use Bing Translate too). If anyone
> has an Amazon unlimited key for translation to send my way would 
> love to add it to the mix too :)
> 
> http://github.com/chrismattmann/bigtranslate/
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 7/12/16, 5:12 PM, "kellen sunderland" <kellen.sunderl...@gmail.com> wrote:
> 
>> Thanks for forwarding Matt.  I think a fair number of people from my team
>> will want to attend.  I'll pass around the registration link.
>> 
>> -Kellen
>> On Jul 12, 2016 11:01 PM, "Matt Post" <p...@cs.jhu.edu> wrote:
>> 
>>> Hi everyone,
>>> 
>>> We had talked a while ago about Joshua projects for MT Marathon in Prague.
>>> Registration (free) is now open. Let me know if you're planning to go and
>>> we can make some plans!
>>> 
>>> http://ufal.mff.cuni.cz/mtm16/registration
>>> 
>>> matt
>>> 
>>> 



Re: Russian Language Model for Joshua

2016-07-16 Thread Matt Post
Done:

http://cs.jhu.edu/~post/tmp/ru.kenlm
4106251755 bytes, sha1sum: 5c894e24dafa42bc44a5bb6822812d6234eda791

Let me know when you have it so I can delete it.

matt


> On Jul 15, 2016, at 4:42 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> All right, started trying to recompile. If you have a machine with > 256 GB 
> of memory, it might be more efficient for me to give you the raw ARPA file 
> and for you to compile it. We'll see how it goes. Ping me in a day if you 
> don't hear from me.
> 
> matt
> 
> 
>> On Jul 15, 2016, at 4:40 PM, Mattmann, Chris A (3980) 
>> <chris.a.mattm...@jpl.nasa.gov> wrote:
>> 
>> Yes please! :)
>> 
>> Sent from my iPhone
>> 
>>> On Jul 15, 2016, at 1:39 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>> I have one built on Common Crawl. It's 25 GB uncompressed. My KenLM 
>>> compiles of it failed in the past, but I'll try again. I expect it to be 
>>> about 8 GB when that's done. Do you want it?
>>> 
>>> matt
>>> 
>>> 
>>>> On Jul 15, 2016, at 3:50 PM, Mattmann, Chris A (3980) 
>>>> <chris.a.mattm...@jpl.nasa.gov> wrote:
>>>> 
>>>> Hey Folks,
>>>> 
>>>> Anyone have a Russian Language Model for Joshua? Lewis was working on
>>>> one, not sure if he has it but just broadening the question.
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattm...@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++
>>>> Director, Information Retrieval and Data Science Group (IRDS)
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> WWW: http://irds.usc.edu/
>>>> ++
>>> 
> 



Re: Russian Language Model for Joshua

2016-07-15 Thread Matt Post
no worries I got it packed. will email later tonight. 

matt (from my phone)

> On Jul 15, 2016, at 6:32 PM, Mattmann, Chris A (3980) 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> Will do.
> 
> Adding Paul Zimdars - do we have an Amazon machine that has > 256GB
> of memory? How much would that cost?
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 7/15/16, 1:42 PM, "Matt Post" <p...@cs.jhu.edu> wrote:
>> 
>> All right, started trying to recompile. If you have a machine with > 256 GB 
>> of memory, it might be more efficient for me to give you the raw ARPA file 
>> and for you to compile it. We'll see how it goes. Ping me in a day if you 
>> don't hear from me.
>> 
>> matt
>> 
>> 
>>> On Jul 15, 2016, at 4:40 PM, Mattmann, Chris A (3980) 
>>> <chris.a.mattm...@jpl.nasa.gov> wrote:
>>> 
>>> Yes please! :)
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Jul 15, 2016, at 1:39 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>>> 
>>>> I have one built on Common Crawl. It's 25 GB uncompressed. My KenLM 
>>>> compiles of it failed in the past, but I'll try again. I expect it to be 
>>>> about 8 GB when that's done. Do you want it?
>>>> 
>>>> matt
>>>> 
>>>> 
>>>>> On Jul 15, 2016, at 3:50 PM, Mattmann, Chris A (3980) 
>>>>> <chris.a.mattm...@jpl.nasa.gov> wrote:
>>>>> 
>>>>> Hey Folks,
>>>>> 
>>>>> Anyone have a Russian Language Model for Joshua? Lewis was working on
>>>>> one, not sure if he has it but just broadening the question.
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email: chris.a.mattm...@nasa.gov
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++
>>>>> Director, Information Retrieval and Data Science Group (IRDS)
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> WWW: http://irds.usc.edu/
>>>>> ++
>> 



Re: Website Branding Issues

2016-07-09 Thread Matt Post
Hi John,

I believe I just corrected this:

http://joshua.incubator.apache.org

However, I don't know who our TLP sponsor is, so the disclaimer is missing that 
portion of the notice. Can someone advise me here?

matt


> On Jul 9, 2016, at 11:18 AM, John D. Ament  wrote:
> 
> Ping.  When can this be expected to be resolved?
> 
> On 2016-07-01 17:55 (-0400), johndam...@apache.org wrote: 
>> Dear podling,
>> 
>> During a recent audit of podling websites, your podling was identified as 
>> not including the incubating disclaimer.  This disclaimer is required on all 
>> podling websites, releases and announcements, to clarify that your project 
>> may not be in compliance with all ASF processes.
>> 
>> Please review the Incubator's branding guide: 
>> http://incubator.apache.org/guides/branding.html
>> 
>> The full list of observations for all projects can be found at: 
>> https://wiki.apache.org/incubator/BrandingAuditJune2016
>> 
>> If you have any questions, feel free to respond to this email or email our 
>> general@ mailing list.  Please note that I am not subscribed to your mailing 
>> list.
>> 



Re: [DISCUSS] Joshua main Website redirect to wiki?

2016-07-09 Thread Matt Post
Yes, we could, it's just a pain. Is there a problem with the wiki being the 
main page? It needs work but being able to edit in place is very convenient. 

matt (from my phone)

> On Jul 9, 2016, at 5:45 PM, Tom Barber <t...@analytical-labs.com> wrote:
> 
> we could just compile the website elsewhere and push the result to the asf
> servers as I plan to do with a new oodt website. jekyll just compiles
> static html after all.
> 
> Tom
>> On 9 Jul 2016 22:42, "Matt Post" <p...@cs.jhu.edu> wrote:
>> 
>> I mentioned it a while back and no one objected, so I did it.
>> 
>> The issue is that the GitHub approach no longer worked because Apache does
>> not employ Jekyll server side, so there was a major impediment to editing
>> files.
>> 
>> I'm open to other options but this is very convenient!
>> 
>> matt (from my phone)
>> 
>>>> On Jul 9, 2016, at 5:31 PM, Henry Saputra <henry.sapu...@gmail.com>
>>> wrote:
>>> 
>>> HI All,
>>> 
>>> I just noticed that the main Joshua website [1] is now redirect to Wiki
>> [2].
>>> 
>>> Was there a discussion why we are doing it this way? I remember there
>> used
>>> to be HTML website for the main page.
>>> 
>>> Thanks,
>>> 
>>> Henry
>>> 
>>> [1] https://joshua.incubator.apache.org
>>> [2]
>> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
>> 
>> 



Re: [DISCUSS] Joshua main Website redirect to wiki?

2016-07-09 Thread Matt Post
I mentioned it a while back and no one objected, so I did it. 

The issue is that the GitHub approach no longer worked because Apache does not 
employ Jekyll server side, so there was a major impediment to editing files. 

I'm open to other options but this is very convenient!

matt (from my phone)

> On Jul 9, 2016, at 5:31 PM, Henry Saputra  wrote:
> 
> HI All,
> 
> I just noticed that the main Joshua website [1] is now redirect to Wiki [2].
> 
> Was there a discussion why we are doing it this way? I remember there used
> to be HTML website for the main page.
> 
> Thanks,
> 
> Henry
> 
> [1] https://joshua.incubator.apache.org
> [2]
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home



Re: [DISCUSS] Joshua main Website redirect to wiki?

2016-07-09 Thread Matt Post
I think the audit had to do with the lack of a disclaimer and the fact that we 
listed it as "Joshua" instead of "Apache Joshua". I fixed both of those. 

matt (from my phone)

> On Jul 9, 2016, at 5:53 PM, Henry Saputra <henry.sapu...@gmail.com> wrote:
> 
> Hi Matt,
> 
> No objection from my side, I think I missed the discussion thread about it.
> I sincerely apologize.
> 
> I am looking at incubator website guide and don't see any rule about how
> the HTML is created.
> I asked John about the audit to see what he has to say.
> 
> 
> - Henry
> 
>> On Sat, Jul 9, 2016 at 2:42 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>> I mentioned it a while back and no one objected, so I did it.
>> 
>> The issue is that the GitHub approach no longer worked because Apache does
>> not employ Jekyll server side, so there was a major impediment to editing
>> files.
>> 
>> I'm open to other options but this is very convenient!
>> 
>> matt (from my phone)
>> 
>>>> On Jul 9, 2016, at 5:31 PM, Henry Saputra <henry.sapu...@gmail.com>
>>> wrote:
>>> 
>>> HI All,
>>> 
>>> I just noticed that the main Joshua website [1] is now redirect to Wiki
>> [2].
>>> 
>>> Was there a discussion why we are doing it this way? I remember there
>> used
>>> to be HTML website for the main page.
>>> 
>>> Thanks,
>>> 
>>> Henry
>>> 
>>> [1] https://joshua.incubator.apache.org
>>> [2]
>> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
>> 
>> 



Re: [DISCUSS] Joshua main Website redirect to wiki?

2016-07-10 Thread Matt Post
Done.



> On Jul 9, 2016, at 5:56 PM, Henry Saputra <henry.sapu...@gmail.com> wrote:
> 
> Need to add the incubator logo [1] as part of website branding
> 
> 
> [1] http://incubator.apache.org/guides/branding.html
> 
> On Sat, Jul 9, 2016 at 2:55 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> I think the audit had to do with the lack of a disclaimer and the fact
>> that we listed it as "Joshua" instead of "Apache Joshua". I fixed both of
>> those.
>> 
>> matt (from my phone)
>> 
>>> On Jul 9, 2016, at 5:53 PM, Henry Saputra <henry.sapu...@gmail.com>
>> wrote:
>>> 
>>> Hi Matt,
>>> 
>>> No objection from my side, I think I missed the discussion thread about
>> it.
>>> I sincerely apologize.
>>> 
>>> I am looking at incubator website guide and don't see any rule about how
>>> the HTML is created.
>>> I asked John about the audit to see what he has to say.
>>> 
>>> 
>>> - Henry
>>> 
>>>> On Sat, Jul 9, 2016 at 2:42 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>>> 
>>>> I mentioned it a while back and no one objected, so I did it.
>>>> 
>>>> The issue is that the GitHub approach no longer worked because Apache
>> does
>>>> not employ Jekyll server side, so there was a major impediment to
>> editing
>>>> files.
>>>> 
>>>> I'm open to other options but this is very convenient!
>>>> 
>>>> matt (from my phone)
>>>> 
>>>>>> On Jul 9, 2016, at 5:31 PM, Henry Saputra <henry.sapu...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>> HI All,
>>>>> 
>>>>> I just noticed that the main Joshua website [1] is now redirect to Wiki
>>>> [2].
>>>>> 
>>>>> Was there a discussion why we are doing it this way? I remember there
>>>> used
>>>>> to be HTML website for the main page.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Henry
>>>>> 
>>>>> [1] https://joshua.incubator.apache.org
>>>>> [2]
>>>> 
>> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
>>>> 
>>>> 
>> 
>> 



Re: Parameters/weights

2016-06-30 Thread Matt Post

Hi Andrew,

Thanks for the note. The work building all the models advertised in the 
paper has fallen behind, but we hope to have it all resolved by the end of 
the month. Hopefully that will resolve some of the problems you have 
pointed out here, too. I have updated the PPDB page with a note.


I should note, though, that just like with general MT output, there is no 
guarantee that the output will be grammatical or well formed. There are 
also no guarantees about what kind of alternatives you'll see. So while I 
think the general paraphrasing results could be better, I am not surprised 
to see strange substitutions and lack of agreement.


matt



On Sat, 25 Jun 2016, Andrew Olney wrote:


I sent this to u...@joshua.incubator.apache.org and it bounced.

I'm trying to understand the behavior of Joshua English Paraphrase 
(apache-joshua-ppdb-2.0-s-all-v1) on this input:


I like cheddar cheese.

The top 10 paraphrases returned are:

I felt cheddar cheese.
I prefers cheddar cheese.
I considered cheddar cheese.
I lay cheddar cheese.
I feels cheddar cheese.
I appreciates cheddar cheese.
I wants cheddar cheese.
I includes cheddar cheese.
I enjoys cheddar cheese.
I wanted cheddar cheese.
I reaffirms cheddar cheese.
I pleased cheddar cheese.
I like cheddar cheese.

Some observations

1. Only the verb is changed. It seems that cheddar cheese -> cheese without 
much loss of information.


2. S/V agreement is not enforced in all cases. This is surprising.

I would like to better understand how to address 1 and 2.

Suggestions appreciated.

Best,

-Andrew



Re: Podling Report Reminder - February 2017

2017-02-01 Thread Matt Post
Folks,

I added the Joshua report.

https://wiki.apache.org/incubator/February2017 


It is due today. Feel free to make comments or initiate discussion here but 
otherwise what's there is what will be sent.

matt


> On Jan 25, 2017, at 7:21 PM, johndam...@apache.org wrote:
> 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 22 February 2017, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, February 08).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
>the project or necessarily of its field
> *   A list of the three most important issues to address in the move
>towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> 
> This should be appended to the Incubator Wiki page at:
> 
> https://wiki.apache.org/incubator/February2017
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC



problems with BerkeleyLM

2017-02-01 Thread Matt Post
Hi folks,

I've found some problems with BerkeleyLM. I haven't diagnosed it yet, and am 
not going to have time for a week or two at least, but thought I'd bring it to 
everyone's attention because this affects our no-external-dependency releases.

As for the solution, in addition to trying to track down this problem, I've 
been working on a docker solution for helping people easily add KenLM to the 
language packs.

The problem can be seen in the following. I trained a English--German model, 
using the state minimizing KenLM (KenLM/Full). You can see the BLEU scores on a 
number of test sets below. If I then swap out the StateMinimizingLanguageModel 
for a regular LanguageModel but using KenLM to represent (KenLM/LM), I get a 
drop as expected. If I then swap out KenLM for BerkeleyLM, I get a further huge 
drop.

I wouldn't expect this large of a drop in either situation, but the BerkeleyLM 
one is especially troubling.

Anyway, troubleshooting is forthcoming, but I am sharing this in case anyone is 
using BerkeleyLM somewhere.

matt

---
news-test2008
KenLM/Full:   => BLEU = 0.1464
KenLM/LM: => BLEU = 0.1168
BerkeleyLM:   => BLEU = 0.0800

newstest2008-14.de-en
KenLM/Full:   => BLEU = 0.1524
KenLM/LM: => BLEU = 0.1235
BerkeleyLM:   => BLEU = 0.0839

newstest2009
KenLM/Full:   => BLEU = 0.1372
KenLM/LM: => BLEU = 0.1113
BerkeleyLM:   => BLEU = 0.0793

newstest2010
KenLM/Full:   => BLEU = 0.1487
KenLM/LM: => BLEU = 0.1213
BerkeleyLM:   => BLEU = 0.0847

newstest2011
KenLM/Full:   => BLEU = 0.1473
KenLM/LM: => BLEU = 0.1192
BerkeleyLM:   => BLEU = 0.0826

newstest2012
KenLM/Full:   => BLEU = 0.1488
KenLM/LM: => BLEU = 0.1205
BerkeleyLM:   => BLEU = 0.0797

newstest2013
KenLM/Full:   => BLEU = 0.1692
KenLM/LM: => BLEU = 0.1391
BerkeleyLM:   => BLEU = 0.0923

newstest2014.de-en
KenLM/Full:   => BLEU = 0.1669
KenLM/LM: => BLEU = 0.1351
BerkeleyLM:   => BLEU = 0.0881

newstest2016.de-en
KenLM/Full:   => BLEU = 0.2177
KenLM/LM: => BLEU = 0.1724
BerkeleyLM:   => BLEU = 0.1117

Re: Podling Report Reminder - February 2017

2017-01-30 Thread Matt Post
Folks — I'll take care of this next week, after February 6.

matt

> On Jan 30, 2017, at 10:18 PM, johndam...@apache.org wrote:
> 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 22 February 2017, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, February 08).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
>the project or necessarily of its field
> *   A list of the three most important issues to address in the move
>towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> 
> This should be appended to the Incubator Wiki page at:
> 
> https://wiki.apache.org/incubator/February2017
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC



Re: Cutting RC3

2017-02-23 Thread Matt Post
Thank you for heading this up, Tommaso! I'll be able to catch up on this after 
today.

matt


> On Feb 23, 2017, at 3:06 AM, Tommaso Teofili  
> wrote:
> 
> probably because of the mentioned network issues the artifacts ended up in
> two separate staging repositories in Nexus, which is undesired.
> I'll drop those repos, rollback the changes on the pom, delete the current
> tag in git and perform again mvn release:prepare / perform today.
> 
> Regards,
> Tommaso
> 
> Il giorno mer 22 feb 2017 alle ore 16:39 Tommaso Teofili <
> tommaso.teof...@gmail.com> ha scritto:
> 
>> Hi all,
>> 
>> Maven is in the extremely slow (because of my bandwidth) process of
>> deploying stuff on Nexus as part of the mvn release:perform phase.
>> In the meantime perhaps is a good idea not to commit to the master branch,
>> until we get the RC3 voted and hence approved / rejected.
>> 
>> Thanks and regards,
>> Tommaso
>> 



Re: mvn assembly issues

2017-01-19 Thread Matt Post
I have never seen this error before! It seems like this must have something to 
do with the build environment where this is being done? Maybe there are tar 
options to not store the userid or to set it to something?


> On Jan 18, 2017, at 9:08 PM, David Meikle  wrote:
> 
> Hey Lewis,
> 
>> On 18 Jan 2017, at 22:02, lewis john mcgibbney  wrote:
>> 
>> Hi Folks,
>> Anyone know how to work through this issue? The code in question can be
>> found at
>> https://github.com/apache/incubator-joshua/blob/master/pom.xml#L287-L309
>> Lewis
>> 
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time: 16.222 s
>> [INFO] Finished at: 2017-01-18T13:59:41-08:00
>> [INFO] Final Memory: 37M/639M
>> [INFO]
>> 
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single
>> (source-release-assembly) on project joshua-incubating: Execution
>> source-release-assembly of goal
>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single failed: user id
>> '498339010' is too big ( > 2097151 ). -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
>> switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
>> 
>> -- 
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
> 
> 
> Normally the switching tar to posix mode does the trick when I have had this 
> before - normally when logged into a AD domain on my Mac.  What is the full 
> log with -X saying?
> 
> Cheers,
> Dave
> 



Re: Plugging self-hosted Joshua into mailman?

2017-01-19 Thread Matt Post
Karel — On this point, I don't think you should have to use the tutorials, 
which tell you how to identify training data and build new translation models 
yourself. I imagine that you would be more interested in downloading pre-built 
models that don't really require you to be an expert in MT. See this page:

https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

matt


> On Jan 17, 2017, at 12:07 PM, lewis john mcgibbney  wrote:
> 
> Hi Karel,
> The short answer is yes.
> I would advise you to start at the Tutorial
> https://cwiki.apache.org/confluence/display/JOSHUA/Getting+Started
> If you find anything which causes you problems then please write back here.
> Once you have skipped through the tutorial then you will have a much better
> feel for the workflow required.
> I can see the Apache Tika language identification and translate API's being
> of particular use here when considered in a runtime context. We have a
> Joshua implementation over in Tika which can aid you in this task however
> try the Joshua tutorial first.
> Lewis
> 
> On Mon, Jan 16, 2017 at 7:41 AM, Chris Mattmann  wrote:
> 
>> Hi Karel,
>> 
>> I would recommend moving this thread to dev@joshua.incubator.apache.org
>> instead of the private list. I’ve moved private to BCC.
>> 
>> Thank you.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>> On 1/16/17, 6:58 AM, wrote:
>> 
>>Hello,
>> 
>>We would like to build a self-hosted machine translation system that
>>could be plugged into our mailman installs. The objective is that the
>>members of our multicultural network would be able to send email in
>>their mother language and it would be delivered to the list
>>machine-translated (and vise versa).
>> 
>>Are we on the right track with Joshua? I suppose that a lot of
>>configuration would be needed, but at this point I want to know if I am
>>not completely mistaken when considering your sw for this.
>> 
>>Thanks
>> 
>>karel
>> 
>> 
>>--
>>~~~
>>Karel Novotny
>>Knowledge Sharing & Network Development Coordinator
>>APC - The Association for Progressive Communications
>>https://www.apc.org
>>GSM: +420 605 243 246 (GMT +1)
>>jabber: ka...@riseup.net
>>Working/online: Monday - Thursday
>>~~~
>>My public OpenPGP key: https://pgp.mit.edu/pks/lookup?op=get=
>> 0x7FDEF502377E4FCA
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: Plugging self-hosted Joshua into mailman?

2017-01-19 Thread Matt Post

> On Jan 17, 2017, at 11:55 AM, Karel Novotný <ka...@apc.org> wrote:
> 
> Hello Matt,
> 
> Thanks for responding...
> 
> On 17.1.2017 17:31, Matt Post wrote:
>> Hello,
>> 
>> Joshua would be suitable to this. We have models built for FR→EN and ES→EN. 
>> I want to improve these because some certain data was left out. I could also 
>> build ones for the other direction.
> That's excellent news. Can you please tell me a bit more about what you
> mean by having models for FR→EN and ES→EN ? Does this mean that the tool
> is ready to be used by other applications (e.g. mailman) to auto-translate?
> 
> Have you had any previous experience with similar implementation as I
> described?

This just means we have pre-built models (which we call "language packs") that 
you can just download and immediately use to translate from French to English 
and from Spanish to English. For the complete list of language packs, along 
with instructions for how to use it, see this page:

https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

You can just download any of these, unpack them, and start translating. The 
quality will vary, but for these two languages should be reasonable.

To translate, the data you send to Joshua has to have already been 
sentence-split, because Joshua expects to receive input one sentence at a time. 
Joshua provides an API that you can make use of. Do you have any kind of 
expectations about your volume requirements? How many sentences will you be 
translating per day?

matt


>> 
>> One question — What do you mean about 3rd party services being 
>> "untrustworthy"?
> 
> We wish to auto-translate lists with private conversations, so we can
> not run those by systems where we don't know (don't have control of)
> what happens with the data. That's all, I didn't want to accuse anyone.

Oh, that makes perfect sense. For some reason I assumed you were translating 
public mailing lists, but if you're doing private ones, it is reasonable to 
want to keep the data entirely in-house.


> thanks
> 
> karel
> 
>> 
>> matt
>> 
>> 
>>> On Jan 16, 2017, at 12:27 PM, Karel Novotný <ka...@apc.org> wrote:
>>> 
>>> Hello developers,
>>> 
>>> I am new to this list, so missing a lot of background. Apologies
>>> beforehand for eventually dumb questions...
>>> 
>>> We would like to build a self-hosted machine translation system that
>>> could be plugged into our mailman installs. The objective is that the
>>> members of our multicultural network would be able to send email in
>>> their mother language and it would be delivered to the list
>>> machine-translated (and vise versa). The translation pairs we care about
>>> most are EN<->FR and EN<->ES
>>> 
>>> Our dream scenario is:
>>> 
>>> 1. A translator machine is installed on our server, so the messages
>>> don't need to be run through untrustworthy 3rd party services (googletrans)
>>> 2. Mailman (or similar) is connected to such a translator
>>> 3. Mailing list users can opt to receive messages sent to the mailing
>>> list in following format:
>>> 
>>> 
>>> Message body
>>> --
>>> Message body translated
>>> -
>>> 
>>> 4. Similarly, the system can be configured so that when receiving
>>> messages from specific senders the messages get translated from FR or ES
>>> into EN
>>> 
>>> Our default language used on lists is EN
>>> 
>>> Is Joshua relevant for this? Any previous experience with similar setup?
>>> I suppose that a lot of configuration would be needed, but at this point
>>> I want to know if I am not completely mistaken when considering your
>>> Joshua for this.
>>> 
>>> Thanks
>>> 
>>> karel
>>> 
>>> ---
>>> 
>>> -- 
>>> ~~~
>>> Karel Novotny 
>>> Knowledge Sharing & Network Development Coordinator
>>> APC - The Association for Progressive Communications 
>>> https://www.apc.org
>>> GSM: +420 605 243 246 (GMT +1)
>>> jabber: ka...@riseup.net
>>> Working/online: Monday - Thursday
>>> ~~~
>>> My public OpenPGP key: 
>>> https://pgp.mit.edu/pks/lookup?op=get=0x7FDEF502377E4FCA
>>> 
>>> 
>> 
> 
> -- 
> ~~~
> Karel Novotny 
> Knowledge Sharing & Network Development Coordinator
> APC - The Association for Progressive Communications 
> https://www.apc.org <https://www.apc.org/>
> GSM: +420 605 243 246 (GMT +1)
> jabber: ka...@riseup.net
> Working/online: Monday - Thursday
> ~~~
> My public OpenPGP key: 
> https://pgp.mit.edu/pks/lookup?op=get=0x7FDEF502377E4FCA 
> <https://pgp.mit.edu/pks/lookup?op=get=0x7FDEF502377E4FCA>


Re: Plugging self-hosted Joshua into mailman?

2017-01-17 Thread Matt Post
Hello,

Joshua would be suitable to this. We have models built for FR→EN and ES→EN. I 
want to improve these because some certain data was left out. I could also 
build ones for the other direction.

One question — What do you mean about 3rd party services being "untrustworthy"?

matt


> On Jan 16, 2017, at 12:27 PM, Karel Novotný  wrote:
> 
> Hello developers,
> 
> I am new to this list, so missing a lot of background. Apologies
> beforehand for eventually dumb questions...
> 
> We would like to build a self-hosted machine translation system that
> could be plugged into our mailman installs. The objective is that the
> members of our multicultural network would be able to send email in
> their mother language and it would be delivered to the list
> machine-translated (and vise versa). The translation pairs we care about
> most are EN<->FR and EN<->ES
> 
> Our dream scenario is:
> 
> 1. A translator machine is installed on our server, so the messages
> don't need to be run through untrustworthy 3rd party services (googletrans)
> 2. Mailman (or similar) is connected to such a translator
> 3. Mailing list users can opt to receive messages sent to the mailing
> list in following format:
> 
> 
> Message body
> --
> Message body translated
> -
> 
> 4. Similarly, the system can be configured so that when receiving
> messages from specific senders the messages get translated from FR or ES
> into EN
> 
> Our default language used on lists is EN
> 
> Is Joshua relevant for this? Any previous experience with similar setup?
> I suppose that a lot of configuration would be needed, but at this point
> I want to know if I am not completely mistaken when considering your
> Joshua for this.
> 
> Thanks
> 
> karel
> 
> ---
> 
> -- 
> ~~~
> Karel Novotny 
> Knowledge Sharing & Network Development Coordinator
> APC - The Association for Progressive Communications 
> https://www.apc.org
> GSM: +420 605 243 246 (GMT +1)
> jabber: ka...@riseup.net
> Working/online: Monday - Thursday
> ~~~
> My public OpenPGP key: 
> https://pgp.mit.edu/pks/lookup?op=get=0x7FDEF502377E4FCA
> 
> 



Re: Pluggable preprocessing and OpenNLP

2017-01-18 Thread Matt Post
Hi,

Sorry, what file format are you talking about? Can you point me to an example 
of the Moses file format? Is this just plain text, one sentence per line?

In general the Moses format is the standard, to the extent that there are any 
standards in MT (they are all mostly informal).

matt

PS. Are you on dev@joshua, or do I need to keep CC'ing you at your address?


> On Jan 16, 2017, at 5:42 PM, Joern Kottmann <kottm...@gmail.com> wrote:
> 
> Hello,
> 
> we came to the conclusion that it would make sense to add direct
> formats support for letsmt and moses files.
> 
> Here our two issues:
> https://issues.apache.org/jira/browse/OPENNLP-938
> https://issues.apache.org/jira/browse/OPENNLP-939
> 
> Does it make sense for you if we support those formats?
> Did we miss an important format?
> 
> The training works quite fine, but it will take me a bit more time to
> get the evaluation to return something useful. The OpenNLP Sentence
> Detector can only split on end-of-sentence (eos) chars. And if there is
> a sentence without an eos chars it gets treated as a mistake by the
> evaluation.
> 
> Do you have a specific language which would be good for testing for
> you?
> 
> The tokenizer can probably trained as well, I saw a couple of tokenized
> data sets. Maybe that makes sense for you too.
> 
> Jörn
> 
> 
> 
> On Fri, 2017-01-13 at 09:48 -0500, Matt Post wrote:
>> Hi Jörn,
>> 
>> [Sent again without the picture since Apache rejects those,
>> unfortunately...]
>> 
>> You just need monolingual text, so I suggest downloading either the
>> tokenized or untokenized versions. Unfortunately, Opus doesn't make
>> it easy to provide directly links to individual languages. But do
>> this:
>> 
>> 1. Go to http://opus.lingfil.uu.se
>> 
>> 2. Choose de → en (or some other language pair)
>> 
>> 3. In the "mono" or "raw" columns (depending on whether you want
>> tokenized or untokenized text), click the language file for the
>> dataset you want.
>> 
>> matt
>> 
>> 
>>> On Jan 12, 2017, at 6:07 AM, Joern Kottmann <kottm...@gmail.com>
>>> wrote:
>>> 
>>> Do you have a pointer to an actual file? Or download package?
>>> 
>>> Jörn
>>> 
>>> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili <tommaso.teofili@
>>> gmail.com
>>>> wrote:
>>>> I think the parallel corpuses are taken from [1], so we could
>>>> start with
>>>> training sentdetect for language packs at [2].
>>>> 
>>>> Regards,
>>>> Tommaso
>>>> 
>>>> [1] : http://opus.lingfil.uu.se/
>>>> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language
>>>> +Packs
>>>> 
>>>> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann <kottmann@
>>>> gmail.com
>>>> ha scritto:
>>>> 
>>>>> Sorry, for late reply, can you point me to a link for the
>>>>> parallel
>>>> corpus?
>>>>> We might just want to add formats support for it to OpenNLP.
>>>>> 
>>>>> Do you use tokenize.pl for all languages or do you have
>>>>> language
>>>> specific
>>>>> heuristics?
>>>>> It would be great to have an additional more capable rule based
>>>>> tokenizer
>>>>> in OpenNLP.
>>>>> 
>>>>> The sentence splitter can be trained on a few thousand
>>>>> sentences or so, I
>>>>> think that will work out nicely.
>>>>> 
>>>>> Jörn
>>>>> 
>>>>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post <p...@cs.jhu.edu>
>>>>> wrote:
>>>>> 
>>>>>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottmann@gmai
>>>>>>> l.com>
>>>>> wrote:
>>>>>>> I am happy to support a bit with this, we can also see if
>>>>>>> things in
>>>>>> OpenNLP
>>>>>>> need to be changed to make this work smoothly.
>>>>>> 
>>>>>> Great!
>>>>>> 
>>>>>> 
>>>>>>> One challenge is to train OpenNLP on all the languages you
>>>>>>> support.
>>>> Do
>>>>>> you
>>>>>>> have training data that could be used to train the
>>>>>>> tokenizer and
>>>>> sentence
>>>>>>> detector?
>>>>>> 
>>>>>> For the sentence-splitter, I imagine you could make use of
>>>>>> the source
>>>>> side
>>>>>> of our parallel corpus, which has thousands to millions of
>>>>>> sentences,
>>>> one
>>>>>> per line.
>>>>>> 
>>>>>> For tokenization (and normalization), we don't typically
>>>>>> train models
>>>> but
>>>>>> instead use a set of manually developed heuristics, which may
>>>>>> or may
>>>> not
>>>>> be
>>>>>> sentence-specific. See
>>>>>> 
>>>>>>https://github.com/apache/incubator-joshua/blob/master
>>>>>> /
>>>>>> scripts/preparation/tokenize.pl
>>>>>> 
>>>>>> How much training data do you generally need for each task?
>>>>>> 
>>>>>> 
>>>>>>> Jörn
>>>>>>> 
>> 
>> 



Re: [DISCUSS] Release Apache Joshua 6.1

2016-08-15 Thread Matt Post
Lewis,

A number of us met in person here in Berlin to discuss this, and we hammered 
out a roadmap. The current state reflected on JIRA is correct. Kellen has a 
multithreading simplification that he'll push up, I have the changes listed 
below, and those will constitute most of what we want to get done for Joshua 
6.1, which we plan to release in September.

In the meantime we have also reorganized the code along multi-module support 
(an earlier github PR). This will serve as the basis for Joshua 7, which we'll 
create a new branch for.

I have tried to adjust the roadmap dates but don't seem to have permission. Can 
you:

- Change the release date of 6.1 to 9/15?
- Delete the Joshua 7 milestone?
- Rename Joshua 6.2 to Joshua 7?
- Set its release date to March 15, 2017?

Thanks,
matt


> On Aug 13, 2016, at 11:39 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> I just massaged the roadmap a bit. I couldn't figure out how to change the 
> dates, though. I think we are on track for a September 6.1 release. Can 
> someone change the target date for 6.1 to 9/1, and remove the target date for 
> 6.2 (which might end up being 7?)
> 
> The main thing I want to do is 
> 
> - Make a moderate change to the way the phrase-based decoder works (this will 
> fix issues 278, 282, 284, and possibly 268)
> - Set up the language packs more formally, with official test sets we can 
> test against, and bundle them with a Joshua JAR so that there are no external 
> dependencies
> 
> matt
> 
> 
>> On Aug 13, 2016, at 3:07 AM, lewis john mcgibbney <lewi...@apache.org> wrote:
>> 
>> Hi dev@,
>> What is waiting on us releasing Joshua 6.1?
>> This is the projected roadmap...
>> https://issues.apache.org/jira/browse/JOSHUA/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel
>> Can someone please take a look through and we can rebase on our move
>> towards a release?
>> Thanks
>> 
>> -- 
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
> 



Re: [VOTE] Release Apache Joshua 6.1 (Incubating)

2017-02-26 Thread Matt Post
Hi folks,

First, Tommaso, thank you for pulling this together!

I want to remind everyone that there's a checklist to go through before sending 
your +1. Here's from an email from Tom Barber a while back:

> Hello folks,
> 
> I see plenty of +1's going through the release vote,  which is great to see
> people taking an active role in getting the release shipped.
> 
> For those of you who are new to the ASF there are a bunch of requirements
> to sign off for a release which you can find here:
> 
> http://incubator.apache.org/guides/releasemanagement.html#check-list 
> 
> 
> My current concern is that people who are new to the incubator are +1'ing
> software for release without check all or part of the release cycle. Whilst
> not mandatory, when you +1 a release please can you try to indicate what
> you've checked. The reason for this is,  the tag Lewis has built off isn't
> the tip of master, so if you're basing  your +1 on your day to day
> development and knowledge of the code base, that's not always whats
> shipped. Also in the branching process,  its possible merges or alterations
> were accidentally made that Lewis has missed (this is very unlikely I know
> but you know, code changes). Also people build software on different OS's,
> versions of OS's etc so just because it builds on  Lewis's laptop doesn't
> mean it builds on mine, for example.
> 
> Also regarding licenses, disclaimers etc, people notice different things or
> interpret stuff differently. its always possible that someone might miss a
> library etc so its important multiple eyes run over the same stuff.
> 
> Cheers,
> 
> Tom

I'm hoping I'll have time to go through this tomorrow.

matt



> On Feb 25, 2017, at 2:41 AM, Tommaso Teofili  
> wrote:
> 
> Hi Folks,
> Please VOTE on the Apache Joshua 6.1 Release Candidate #3.
> 
> We solved 36 issues:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319720=12335049
> 
> Git source tag (3447715b3aa0a48ed79465d80618bd5a2f7a7558):
> https://s.apache.org/XIxJ
> 
> Staging repo:
> https://repository.apache.org/content/repositories/orgapachejoshua-1004
> 
> Source Release Artifacts:
> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
> 
> PGP release keys (signed using 891768A5):
> *https://git1-us-west.apache.org/repos/asf?p=incubator-joshua.git;a=blob_plain;f=KEYS;h=aa18365bf5c8c8fb17b084f783a75c3a2460a98d;hb=HEAD
> *
> 
> Vote will be open for 72 hours.
> Thank you to everyone that is able to VOTE as well as everyone that
> contributed to Apache Joshua 6.1.
> 
> [ ] +1, let's get it released!!!
> [ ] +/-0, fine, but consider to fix few issues before...
> [ ] -1, nope, because... (and please explain why)
> 
> Regards,
> Tommaso



Re: [jira] [Commented] (JOSHUA-304) word-align.conf alignment template file not compatible with berkeley aligner

2016-08-24 Thread Matt Post
It didn't regenerate. Try wiping out your rundir and starting over. 

matt (from my phone)

> On Aug 24, 2016, at 4:08 PM, Lewis John McGibbney (JIRA)  
> wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435687#comment-15435687
>  ] 
> 
> Lewis John McGibbney commented on JOSHUA-304:
> -
> 
> [~post] unfortunately my local tests are still not coming up with anything 
> fruitful.
> {code}
> lmcgibbn@LMC-032857 /usr/local/incubator-joshua(JOSHUA-304) $ 
> $JOSHUA/bin/pipeline.pl --type hiero --rundir 8 --readme "Baseline Hiero run 
> 8 --lm-gen berkeleylm --lm berkeleylm --aligner berkeley proposed bug fixed 
> in ../../scripts/training/paralign.pl" --source es --target en --lm-gen 
> berkeleylm --lm berkeleylm --aligner berkeley --corpus 
> $SPANISH/corpus/asr/callhome_train --corpus $SPANISH/corpus/asr/fisher_train 
> --tune  $SPANISH/corpus/asr/fisher_dev --test  
> $SPANISH/corpus/asr/callhome_devtest
> [train-copy-and-filter] cached, skipping...
> [train-tokenize-es] cached, skipping...
> [train-tokenize-en] cached, skipping...
> [train-trim] cached, skipping...
> [train-lowercase-es] cached, skipping...
> [train-lowercase-en] cached, skipping...
> [train-vocab-es] cached, skipping...
> [train-vocab-en] cached, skipping...
> [tune-copy-and-filter] cached, skipping...
> [tune-tokenize-es] cached, skipping...
> [tune-tokenize-en.0] cached, skipping...
> [tune-tokenize-en.1] cached, skipping...
> [tune-tokenize-en.2] cached, skipping...
> [tune-tokenize-en.3] cached, skipping...
> [tune-lowercase-es] cached, skipping...
> [tune-lowercase-en.0] cached, skipping...
> [tune-lowercase-en.1] cached, skipping...
> [tune-lowercase-en.2] cached, skipping...
> [tune-lowercase-en.3] cached, skipping...
> [tune-vocab-es] cached, skipping...
> [tune-vocab-en.0] cached, skipping...
> [tune-vocab-en.1] cached, skipping...
> [tune-vocab-en.2] cached, skipping...
> [tune-vocab-en.3] cached, skipping...
> [test-copy-and-filter] cached, skipping...
> [test-tokenize-es] cached, skipping...
> [test-tokenize-en] cached, skipping...
> [test-lowercase-es] cached, skipping...
> [test-lowercase-en] cached, skipping...
> [test-vocab-es] cached, skipping...
> [test-vocab-en] cached, skipping...
> [source-numlines] cached, skipping...
> [source-numlines] retrieved cached result =>   151810
> [berkeley-aligner-chunk-0] rebuilding...
>  dep=alignments/0/word-align.conf [CHANGED]
>  dep=/usr/local/incubator-joshua/8/data/train/splits/corpus.es.0 [NOT FOUND]
>  dep=/usr/local/incubator-joshua/8/data/train/splits/corpus.en.0 [NOT FOUND]
>  dep=alignments/0/training.align [NOT FOUND]
>  cmd=java -d64 -Xmx10g -jar 
> /usr/local/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar
>  ++alignments/0/word-align.conf
>  JOB FAILED (return code 1)
> [aligner-combine] rebuilding...
>  dep=alignments/0/training.en-es.align [NOT FOUND]
>  dep=alignments/training.align [CHANGED]
>  cmd=cat alignments/0/training.en-es.align > alignments/training.align
>  JOB FAILED (return code 1)
> cat: alignments/0/training.en-es.align: No such file or directory
> {code}
> 
>> word-align.conf alignment template file not compatible with berkeley aligner
>> 
>> 
>>Key: JOSHUA-304
>>URL: https://issues.apache.org/jira/browse/JOSHUA-304
>>Project: Joshua
>> Issue Type: Bug
>> Components: alignment, berkeley, templates
>>   Affects Versions: 6.0.5
>>   Reporter: Lewis John McGibbney
>>   Priority: Blocker
>>Fix For: 6.1
>> 
>> 
>> It takes me quite some time to debug what was going on and why pipeline's 
>> were failing when using the berkeley aligner.
>> It turns out that the word-align.conf template provided at
>> https://github.com/apache/incubator-joshua/blob/master/scripts/training/templates/alignment/word-align.conf
>> is not compatible with the berkeley aligner. 
>> In particular the following lines are non compatible
>> https://github.com/apache/incubator-joshua/blob/master/scripts/training/templates/alignment/word-align.conf#L12-L15
>> Evidence of this is provided below
>> {code}
>> lmcgibbn@LMC-032857 /usr/local/incubator-joshua/lib(master) $ java -d64 
>> -Xmx10g -jar /usr/local/incubator-joshua/lib/berkeleyaligner.jar 
>> ++/usr/local/incubator-joshua/experiments/fisher_callhome_experiment/6/alignments/0/word-align.conf
>> Invalid enum: 'MODEL1 HMM'; valid choices: MODEL1|MODEL2|HMM|SYNTACTIC|NONE
>> lmcgibbn@LMC-032857 /usr/local/incubator-joshua/lib(master) $ java -d64 
>> -Xmx10g -jar /usr/local/incubator-joshua/lib/berkeleyaligner.jar 
>> ++/usr/local/incubator-joshua/experiments/fisher_callhome_experiment/6/alignments/0/word-align.conf
>> Invalid enum: 'MODEL1, HMM'; valid choices: MODEL1|MODEL2|HMM|SYNTACTIC|NONE
>> 

Re: [GitHub] incubator-joshua issue #42: Fix various issues related to resources, warning...

2016-08-30 Thread Matt Post
Rebasing changes the history, so I think you can't do that with repos that have 
been pushed, right? In which case merge...

matt (from my phone)

> On Aug 30, 2016, at 3:41 PM, maxthomas  wrote:
> 
> Github user maxthomas commented on the issue:
> 
>https://github.com/apache/incubator-joshua/pull/42
> 
>do you want me to rebase off master, or merge? 
> 
> 
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---



roadmap

2016-09-30 Thread Matt Post
Hi folks,

Just a status update, since I / we are a bit behind: I'm in the process of 
putting together the first language pack, along with a script that will bundle 
it with the jar, a README describing its use and assembly, a CREDITS file 
describing the data used to build the model, and a BENCHMARK file listing the 
performance on test sets. All of these are being more-or-less automatically 
assembled and I think it's important to include in the language packs.

Once I have the version of that put together, I'll post it for your review and 
testing. I hope to do this first thing next week. We can then move to do our 
first release. There are a number of small things we need to do (updating the 
CHANGELOG, site documentation, etc), but I think we're mostly ready.

A colleague here is also putting together a large number of language packs for 
lots of different languages. I'll have a list soon.

matt



Re: moses2 vs. joshua

2016-10-05 Thread Matt Post
Hi folks,

Sorry this took so long, long story. But the four models that Hieu shared with 
me are ready. You can download them here; they're each about 15–20 GB.

  http://cs.jhu.edu/~post/files/joshua-hiero-ar-en.tbz 
<http://cs.jhu.edu/~post/files/joshua-hiero-ar-en.tbz>
  http://cs.jhu.edu/~post/files/joshua-phrase-ar-en.tbz 
<http://cs.jhu.edu/~post/files/joshua-phrase-ar-en.tbz>
  http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz 
<http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz>
  http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz

It'd be great if someone could test them on a machine with lots of cores, to 
see how things scale.

matt

> On Sep 22, 2016, at 9:09 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> Hi folks,
> 
> I have finished the comparison. Here you can find graphs for ar-en and ru-en. 
> The ground-up rewrite of Moses is 
> about 2x–3x faster than Joshua.
> 
>   http://imgur.com/a/FcIbW <http://imgur.com/a/FcIbW>
> 
> One implication (untested) is that we are likely as fast as or faster than 
> Moses.
> 
> We could brainstorm things to do to close this gap. I'd be much happier with 
> 2x or even 1.5x than with 3x, and I bet we could narrow this down. But I'd 
> like to get the 6.1 release out of the way, first, so I'm pushing this off to 
> next month. Sound cool?
> 
> matt
> 
> 
>> On Sep 19, 2016, at 6:26 AM, Matt Post <p...@cs.jhu.edu 
>> <mailto:p...@cs.jhu.edu>> wrote:
>> 
>> I can't believe I did this, but I mis-colored one of the hiero lines, and 
>> the Numbers legend doesn't show the line type. If you reload the dropbox 
>> file, it's fixed now. The difference is about 3x for both. Here's the table.
>> 
>> Threads
>> Joshua
>> Moses2
>> Joshua (hiero)
>> Moses2 (hiero)
>> Phrase rate
>> Hiero rate
>> 1
>> 178
>> 65
>> 2116
>> 1137
>> 2.74
>> 1.86
>> 2
>> 109
>> 42
>> 1014
>> 389
>> 2.60
>> 2.61
>> 4
>> 78
>> 29
>> 596
>> 213
>> 2.69
>> 2.80
>> 6
>> 72
>> 25
>> 473
>> 154
>> 2.88
>> 3.07
>> 
>> I'll put the models together and share them later today. This was on a 
>> 6-core machine and I agree it'd be nice to test with something much higher.
>> 
>> matt
>> 
>> 
>>> On Sep 19, 2016, at 5:33 AM, kellen sunderland <kellen.sunderl...@gmail.com 
>>> <mailto:kellen.sunderl...@gmail.com><mailto:kellen.sunderl...@gmail.com 
>>> <mailto:kellen.sunderl...@gmail.com>>> wrote:
>>> 
>>> Do we just want to store these models somewhere temporarily?  I've got a 
>>> OneDrive account and could share the models from there (as long as they're 
>>> below 500GBs or so).
>>> 
>>> On Mon, Sep 19, 2016 at 11:32 AM, kellen sunderland 
>>> <kellen.sunderl...@gmail.com <mailto:kellen.sunderl...@gmail.com> 
>>> <mailto:kellen.sunderl...@gmail.com <mailto:kellen.sunderl...@gmail.com>>> 
>>> wrote:
>>> Very nice results.  I think getting to within 25% of a optimized c++ 
>>> decoder from a Java decoder is impressive.  Great that Hieu has put in the 
>>> work to make moses2 so fast as well, that gives organizations two quite 
>>> nice decoding engines to choose from, both with reasonable performance.
>>> 
>>> Matt: I had a question about the x axis here.  Is that number of threads?  
>>> We should be scaling more or less linearly with the number of threads, is 
>>> that the case here?  If you post the models somewhere I can also do a quick 
>>> benchmark on a machine with a few more cores. 
>>> 
>>> -Kellen
>>> 
>>> 
>>> On Mon, Sep 19, 2016 at 10:53 AM, Tommaso Teofili 
>>> <tommaso.teof...@gmail.com 
>>> <mailto:tommaso.teof...@gmail.com><mailto:tommaso.teof...@gmail.com 
>>> <mailto:tommaso.teof...@gmail.com>>> wrote:
>>> Il giorno sab 17 set 2016 alle ore 15:23 Matt Post <p...@cs.jhu.edu 
>>> <mailto:p...@cs.jhu.edu><mailto:p...@cs.jhu.edu <mailto:p...@cs.jhu.edu>>> 
>>> ha
>>> scritto:
>>> 
>>>> I'll ask Hieu; I don't anticipate any problems. One potential problem is
>>>> that that models occupy about 15--20 GB; do you think Jenkins would host
>>>> this?
>>>> 
>>> 
>>> I'm not sure, can such models be downloaded and pruned at runtime, or do
>>> they need to exist on the Jenkins machine ?
>>> 
>>> 
>

Re: language pack #1

2016-10-06 Thread Matt Post
Okay, I've fixed the nonbreaking_prefixes path issue.

The installation should now ignore your value of $JOSHUA entirely, preferring 
instead the bundled jar and scripts (maybe test this by unsetting $JOSHUA).

New version:

http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz 
<http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz>

Please note: my tests show that using BerkeleyLM results in a notable drop in 
performance (1–2 BLEU points across many test sets). I am worried that we have 
introduced a bug in LanguageModelFF.java. We use BerkeleyLM so that users don't 
have to compile KenLM, but we're probably going to need to provide the option 
to "upgrade" for those willing to try to compile it. Or we'll need a solution 
for distributing pre-built KenLM shared libraries...

matt



> On Oct 5, 2016, at 11:43 PM, John Hewitt <john...@seas.upenn.edu> wrote:
> 
> Quick further note -- I already had $JOSHUA set to a different directory,
> so initially all the lookups were failing.
> 
> It's possible current users of JOSHUA will as well when they download new
> language packs. This should be an obvious and quick fix for the user, but I
> don't know if there's something we could do in the name of making it even
> clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
> the directory change in prepare.sh, and printing a warning if it's not?)
> 
> -John
> 
> On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt <john...@seas.upenn.edu> wrote:
> 
>> Thanks, Matt!
>> 
>> Some notes:
>> 
>> When piping input into prepare.sh, I get the following output:
>> 
>> WARNING: No known abbreviations for language 'es', attempting fall-back to
>> English version...
>> ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
>> joshua-es-en-2016-10-05/scripts/preparation/nonbre
>> aking_prefixes
>> 
>> Seems that line 12 of tokenize.pl:
>> my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
>> should be:
>> my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
>> 
>> When I make this modification, it works just fine for me.
>> Also, tried in server mode -- seems to work without issue.
>> 
>> (For reference -- executed on an openSUSE cluster)
>> 
>> -John
>> 
>> 
>> 
>> On Wed, Oct 5, 2016 at 10:36 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>>> Hi folks,
>>> 
>>> I have managed to assemble an actual working language pack. Consider this
>>> a (near-final, I hope) draft of what we're rolling out for lots of
>>> languages. Please download it, check out the README and associated files,
>>> test it, and let me know what's missing or what needs to change.
>>> 
>>>http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz
>>> <http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz> (2.1
>>> GB)
>>> 
>>> Suggested use:
>>> 
>>>tar xzvf apache-joshua-es-en-2016-10-05.tgz
>>>echo "\"Yo quiero Taco Bell,\", él dijo." \
>>>| ./apache-joshua-es-en-2016-10-05/prepare.sh \
>>>| ./apache-joshua-es-en-2016-10-05/joshua
>>> 
>>> matt
>> 
>> 
>> 



thrax problem

2016-10-07 Thread Matt Post
Hi folks,

I thought I'd let you know about a problem I discovered with Thrax. Can you 
spot it?

$ ls -lh grammar.gz
-rw-r--r-- 1 mpost staff 2.2G Oct  6 13:55 grammar.gz
$ gzip -cd 9/grammar.gz | cut -d\| -f4 | uniq -c | sort -n | tail
   8448  las 
   8643  a 
   9440  que 
   9595  se 
   9696  , 
  10617  los 
  10885  el 
  11687  en 
  11932  de 
  12738  la 

As you can see, for lots of source sides, there are tons of target options. The 
first time any rule is used, all the target sides are scored with 
estimateRule() in order to sort them (including a call to the LM), and then all 
but the top 20 (configurable with -num_translation_options) are discarded. This 
is a big waste: the useless rules are stored on disk, and while the 
compute-time waste is constant-time, it does make a difference in "warming up" 
the decoder and, of course, memory usage.

The problem is that Thrax takes all target sides it finds during training. It 
would be good to add an option to Thrax that only keeps the top X translation 
options for each source side (where X is maybe 100).

matt




Re: language pack #1

2016-10-07 Thread Matt Post
That would be awesome.

matt


> On Oct 7, 2016, at 11:49 AM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> I was actually going to try and build KenLM into a maven package that can
> be easily distributed.  I haven't had time to work on it too much but I
> think it shouldn't be too hard.
> 
> On Thu, Oct 6, 2016 at 4:16 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Okay, I've fixed the nonbreaking_prefixes path issue.
>> 
>> The installation should now ignore your value of $JOSHUA entirely,
>> preferring instead the bundled jar and scripts (maybe test this by
>> unsetting $JOSHUA).
>> 
>> New version:
>> 
>>http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz <
>> http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz>
>> 
>> Please note: my tests show that using BerkeleyLM results in a notable drop
>> in performance (1–2 BLEU points across many test sets). I am worried that
>> we have introduced a bug in LanguageModelFF.java. We use BerkeleyLM so that
>> users don't have to compile KenLM, but we're probably going to need to
>> provide the option to "upgrade" for those willing to try to compile it. Or
>> we'll need a solution for distributing pre-built KenLM shared libraries...
>> 
>> matt
>> 
>> 
>> 
>>> On Oct 5, 2016, at 11:43 PM, John Hewitt <john...@seas.upenn.edu> wrote:
>>> 
>>> Quick further note -- I already had $JOSHUA set to a different directory,
>>> so initially all the lookups were failing.
>>> 
>>> It's possible current users of JOSHUA will as well when they download new
>>> language packs. This should be an obvious and quick fix for the user,
>> but I
>>> don't know if there's something we could do in the name of making it even
>>> clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
>>> the directory change in prepare.sh, and printing a warning if it's not?)
>>> 
>>> -John
>>> 
>>> On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt <john...@seas.upenn.edu>
>> wrote:
>>> 
>>>> Thanks, Matt!
>>>> 
>>>> Some notes:
>>>> 
>>>> When piping input into prepare.sh, I get the following output:
>>>> 
>>>> WARNING: No known abbreviations for language 'es', attempting fall-back
>> to
>>>> English version...
>>>> ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
>>>> joshua-es-en-2016-10-05/scripts/preparation/nonbre
>>>> aking_prefixes
>>>> 
>>>> Seems that line 12 of tokenize.pl:
>>>> my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
>>>> should be:
>>>> my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
>>>> 
>>>> When I make this modification, it works just fine for me.
>>>> Also, tried in server mode -- seems to work without issue.
>>>> 
>>>> (For reference -- executed on an openSUSE cluster)
>>>> 
>>>> -John
>>>> 
>>>> 
>>>> 
>>>> On Wed, Oct 5, 2016 at 10:36 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> I have managed to assemble an actual working language pack. Consider
>> this
>>>>> a (near-final, I hope) draft of what we're rolling out for lots of
>>>>> languages. Please download it, check out the README and associated
>> files,
>>>>> test it, and let me know what's missing or what needs to change.
>>>>> 
>>>>>   http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-
>> 05.tgz
>>>>> <http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz>
>> (2.1
>>>>> GB)
>>>>> 
>>>>> Suggested use:
>>>>> 
>>>>>   tar xzvf apache-joshua-es-en-2016-10-05.tgz
>>>>>   echo "\"Yo quiero Taco Bell,\", él dijo." \
>>>>>   | ./apache-joshua-es-en-2016-10-05/prepare.sh \
>>>>>   | ./apache-joshua-es-en-2016-10-05/joshua
>>>>> 
>>>>> matt
>>>> 
>>>> 
>>>> 
>> 
>> 



Re: moses2 vs. joshua

2016-09-22 Thread Matt Post
Hi folks,

I have finished the comparison. Here you can find graphs for ar-en and ru-en. 
The ground-up rewrite of Moses is 
about 2x–3x faster than Joshua.

http://imgur.com/a/FcIbW

One implication (untested) is that we are likely as fast as or faster than 
Moses.

We could brainstorm things to do to close this gap. I'd be much happier with 2x 
or even 1.5x than with 3x, and I bet we could narrow this down. But I'd like to 
get the 6.1 release out of the way, first, so I'm pushing this off to next 
month. Sound cool?

matt


> On Sep 19, 2016, at 6:26 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> I can't believe I did this, but I mis-colored one of the hiero lines, and the 
> Numbers legend doesn't show the line type. If you reload the dropbox file, 
> it's fixed now. The difference is about 3x for both. Here's the table.
> 
> Threads
> Joshua
> Moses2
> Joshua (hiero)
> Moses2 (hiero)
> Phrase rate
> Hiero rate
> 1
> 178
> 65
> 2116
> 1137
> 2.74
> 1.86
> 2
> 109
> 42
> 1014
> 389
> 2.60
> 2.61
> 4
> 78
> 29
> 596
> 213
> 2.69
> 2.80
> 6
> 72
> 25
> 473
> 154
> 2.88
> 3.07
> 
> I'll put the models together and share them later today. This was on a 6-core 
> machine and I agree it'd be nice to test with something much higher.
> 
> matt
> 
> 
>> On Sep 19, 2016, at 5:33 AM, kellen sunderland <kellen.sunderl...@gmail.com 
>> <mailto:kellen.sunderl...@gmail.com>> wrote:
>> 
>> Do we just want to store these models somewhere temporarily?  I've got a 
>> OneDrive account and could share the models from there (as long as they're 
>> below 500GBs or so).
>> 
>> On Mon, Sep 19, 2016 at 11:32 AM, kellen sunderland 
>> <kellen.sunderl...@gmail.com <mailto:kellen.sunderl...@gmail.com>> wrote:
>> Very nice results.  I think getting to within 25% of a optimized c++ decoder 
>> from a Java decoder is impressive.  Great that Hieu has put in the work to 
>> make moses2 so fast as well, that gives organizations two quite nice 
>> decoding engines to choose from, both with reasonable performance.
>> 
>> Matt: I had a question about the x axis here.  Is that number of threads?  
>> We should be scaling more or less linearly with the number of threads, is 
>> that the case here?  If you post the models somewhere I can also do a quick 
>> benchmark on a machine with a few more cores. 
>> 
>> -Kellen
>> 
>> 
>> On Mon, Sep 19, 2016 at 10:53 AM, Tommaso Teofili <tommaso.teof...@gmail.com 
>> <mailto:tommaso.teof...@gmail.com>> wrote:
>> Il giorno sab 17 set 2016 alle ore 15:23 Matt Post <p...@cs.jhu.edu 
>> <mailto:p...@cs.jhu.edu>> ha
>> scritto:
>> 
>>> I'll ask Hieu; I don't anticipate any problems. One potential problem is
>>> that that models occupy about 15--20 GB; do you think Jenkins would host
>>> this?
>>> 
>> 
>> I'm not sure, can such models be downloaded and pruned at runtime, or do
>> they need to exist on the Jenkins machine ?
>> 
>> 
>>> 
>>> (ru-en grammars still packing, results will probably not be in until much
>>> later today)
>>> 
>>> matt
>>> 
>>> 
>>>> On Sep 17, 2016, at 3:19 PM, Tommaso Teofili <tommaso.teof...@gmail.com 
>>>> <mailto:tommaso.teof...@gmail.com>>
>>> wrote:
>>>> 
>>>> Hi Matt,
>>>> 
>>>> I think it'd be really valuable if we could be able to repeat the same
>>>> tests (given parallel corpus is available) in the future, any chance you
>>>> can share script / code to do that ? We may even consider adding a
>>> Jenkins
>>>> job dedicated to continuously monitor performances as we work on Joshua
>>>> master branch.
>>>> 
>>>> WDYT?
>>>> 
>>>> Anyway thanks for sharing the very interesting comparisons.
>>>> Regards,
>>>> Tommaso
>>>> 
>>>> Il giorno sab 17 set 2016 alle ore 12:29 Matt Post <p...@cs.jhu.edu 
>>>> <mailto:p...@cs.jhu.edu>> ha
>>>> scritto:
>>>> 
>>>>> Ugh, I think the mailing list deleted the attachment. Here is an attempt
>>>>> around our censors:
>>>>> 
>>>>> https://www.dropbox.com/s/80up63reu4q809y/ar-en-joshua-moses2.png?dl=0 
>>>>> <https://www.dropbox.com/s/80up63reu4q809y/ar-en-joshua-moses2.png?dl=0>
>>>>> 
>>>>> 
>>>>>> On Sep 17, 2016, a

Re: Build failed in Jenkins: joshua_master #96

2016-08-24 Thread Matt Post
We are running out of space on builds...


> On Aug 23, 2016, at 10:15 PM, Apache Jenkins Server 
>  wrote:
> 
> See 
> 
> Changes:
> 
> [lewis.mcgibbney] Update examples README formatting and links.
> 
> [lewis.mcgibbney] Update examples README pipeline invocation parameters
> 
> --
> Started by an SCM change
> [EnvInject] - Loading node environment variables.
> Building remotely on ubuntu-us1 (Ubuntu golang-ppa ubuntu-us ubuntu) in 
> workspace 
>> git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>> git config remote.origin.url 
>> https://git-wip-us.apache.org/repos/asf/incubator-joshua.git # timeout=10
> Fetching upstream changes from 
> https://git-wip-us.apache.org/repos/asf/incubator-joshua.git
>> git --version # timeout=10
>> git -c core.askpass=true fetch --tags --progress 
>> https://git-wip-us.apache.org/repos/asf/incubator-joshua.git 
>> +refs/heads/*:refs/remotes/origin/*
>> git rev-parse refs/remotes/origin/master^{commit} # timeout=10
>> git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
> Checking out Revision 0744ebf56906dbe70292737cd50a39652407869d 
> (refs/remotes/origin/master)
>> git config core.sparsecheckout # timeout=10
>> git checkout -f 0744ebf56906dbe70292737cd50a39652407869d
>> git rev-list ff410c297a149400db3cb553b11a930ad01dc7ed # timeout=10
> [joshua_master] $ /home/jenkins/tools/maven/latest3/bin/mvn clean install 
> javadoc:aggregate
> Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space for shared 
> memory file:
>   26586
> Try using the -Djava.io.tmpdir= option to select an alternate temp location.
> 
> [INFO] Scanning for projects...
> [INFO]
>  
> [INFO] 
> 
> [INFO] Building Apache Joshua Machine Translation Toolkit 6.1-SNAPSHOT
> [INFO] 
> 
> [INFO] 
> [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ joshua ---
> [INFO] Deleting 
> [INFO] 
> [INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ joshua ---
> [INFO] 
> [INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ joshua 
> ---
> [debug] execute contextualize
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 1 resource
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ joshua ---
> [INFO] Compiling 266 source files to 
> 
> [INFO] 
> [INFO] --- maven-resources-plugin:2.5:testResources (default-testResources) @ 
> joshua ---
> [debug] execute contextualize
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 349 resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ 
> joshua ---
> [INFO] Compiling 38 source files to 
> 
> [INFO] 
> [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ joshua ---
> 
> ---
> T E S T S
> ---
> Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space for shared 
> memory file:
>   27154
> Try using the -Djava.io.tmpdir= option to select an alternate temp location.
> Running TestSuite
> 102030405060708090.100%
> ERROR - Can't find libken.so (libken.dylib on OS X) on the Java library path.
> tm_pt_0=-2.000 tm_glue_0=3.000 lm_0=-206.718 lm_0_oov=2.000 
> OOVPenalty=-200.000 | -198.000
> ERROR - Can't find libken.so (libken.dylib on OS X) on the Java library path.
> ERROR - Can't find libken.so (libken.dylib on OS X) on the Java library path.
> ERROR - Can't find libken.so (libken.dylib on OS X) on the Java library path.
> %
> %
> %
> %
> %
> %
> %
> %
> %
> Tests run: 48, Failures: 1, Errors: 0, Skipped: 9, Time elapsed: 4.219 sec 
> <<< FAILURE! - in TestSuite
> externalizeVocabulary(org.apache.joshua.util.io.BinaryTest)  Time elapsed: 
> 0.01 sec  <<< FAILURE!
> java.io.IOException: No space left on device
>   at 
> org.apache.joshua.util.io.BinaryTest.externalizeVocabulary(BinaryTest.java:56)
> 
> 
> Results :
> 
> Failed tests: 
>  BinaryTest.externalizeVocabulary:56 » IO No space left on device
> 
> Tests run: 46, Failures: 1, Errors: 0, Skipped: 7
> 
> [INFO]
>  
> [INFO] 
> 
> [INFO] 

Re: [jira] [Commented] (JOSHUA-291) Improve code quality via static analysis

2016-09-28 Thread Matt Post
Great, thanks for sending the email.


> On Sep 28, 2016, at 5:13 AM, Tommaso Teofili <tommaso.teof...@gmail.com> 
> wrote:
> 
> it's fixed now [1].
> 
> Regards,
> Tommaso
> 
> [1] : https://github.com/apache/incubator-joshua
> 
> Il giorno mar 27 set 2016 alle ore 21:15 Tommaso Teofili <
> tommaso.teof...@gmail.com> ha scritto:
> 
>> Right, that's weird, it should sync automatically.
>> I've noticed that also other mirrors at apache have the same issue, maybe
>> it's related to a failure on github mirroring from yesterday, as you can
>> see from [1].
>> 
>> I'd opt for waiting a few more hours, then I'd ask infra@.
>> 
>> Regards,
>> Tommaso
>> 
>> [1] : http://status.apache.org/
>> 
>> 
>> Il giorno mar 27 set 2016 alle ore 20:38 Matt Post <p...@cs.jhu.edu> ha
>> scritto:
>> 
>>> Tommaso, this looks great. One problem, though, is that while these are
>>> present on Apache master, they are for some reason not being mirrored to
>>> Github. Anyone have any idea why?
>>> 
>>> 
>>> https://git-wip-us.apache.org/repos/asf?p=incubator-joshua.git;a=summary
>>> 
>>> The last commit to Github is from 12 days ago:
>>> 
>>>https://github.com/apache/incubator-joshua/
>>> 
>>> This is supposed to be mirrored automatically, I wonder what's up?
>>> 
>>> matt
>>> 
>>> 
>>> 
>>>> On Sep 26, 2016, at 8:30 AM, Tommaso Teofili (JIRA) <j...@apache.org>
>>> wrote:
>>>> 
>>>> 
>>>>   [
>>> https://issues.apache.org/jira/browse/JOSHUA-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522917#comment-15522917
>>> ]
>>>> 
>>>> Tommaso Teofili commented on JOSHUA-291:
>>>> 
>>>> 
>>>> I've applied a bunch of improvements to the codebase based on static
>>> code analysis, marking as resolved for now.
>>>> 
>>>>> Improve code quality via static analysis
>>>>> 
>>>>> 
>>>>>   Key: JOSHUA-291
>>>>>   URL: https://issues.apache.org/jira/browse/JOSHUA-291
>>>>>   Project: Joshua
>>>>>Issue Type: Improvement
>>>>>Components: core
>>>>>  Reporter: Tommaso Teofili
>>>>>  Assignee: Tommaso Teofili
>>>>>   Fix For: 6.1
>>>>> 
>>>>> 
>>>>> We can improve code quality / readability leveraging code analysis
>>> from tools like FindBugs and others integrated in IDEs.
>>>> 
>>>> 
>>>> 
>>>> --
>>>> This message was sent by Atlassian JIRA
>>>> (v6.3.4#6332)
>>> 
>>> 



Re: openjdk 8 incompatibility

2016-10-25 Thread Matt Post
Hmm, inclusion of that line looks like a mistake. I've seen Eclipse add random 
imports because it sorts the suggestions in a very unhelpful manner. I just 
removed the line and pushed, try again.


> On Oct 25, 2016, at 1:11 PM, John Hewitt  wrote:
> 
> Hi all,
> 
> Has anyone been able to compile Joshua with openjdk? I get this message:
> 
> /home/john/java/incubator-joshua/src/main/java/org/apache/joshua/decoder/ff/lm/KenLM.java:[21,19]
> error: package javafx.scene does not exist
> 
> And the following link seems to confirm that javafx is not a part of
> openjdk.
> https://ask.fedoraproject.org/en/question/93407/there-is-no-javafx-packages-in-openjdk-180-fedora-gnulinux/
> 
> -John



Re: [jira] [Created] (JOSHUA-320) --joshua-mem pipeline parameter is not populated to mert processes

2016-10-27 Thread Matt Post
Hi Lewis,

You are confusing two things.

MERT calls Joshua, and passes it however much memory you set with --joshua-mem. 
It doesn't this by writing (see pipeline.pl line 1550) 
$tunedir/decoder_command, which is what Z-MERT calls to run Joshua.

Z-MERT is itself a Java program that also gets 4g. There is no option to change 
this and I don't think there needs to be, although if you disagree, it wouldn't 
hurt to add it.


> On Oct 27, 2016, at 3:21 PM, Lewis John McGibbney (JIRA)  
> wrote:
> 
> Lewis John McGibbney created JOSHUA-320:
> ---
> 
> Summary: --joshua-mem pipeline parameter is not populated to mert 
> processes
> Key: JOSHUA-320
> URL: https://issues.apache.org/jira/browse/JOSHUA-320
> Project: Joshua
>  Issue Type: Bug
>  Components: mert, pipeline
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 6.2
> 
> 
> As we've discussed on the Joshua mailing list at 
> http://www.mail-archive.com/dev%40joshua.incubator.apache.org/msg01765.html
> it is not realistic to reserve only 4g for several tasks which are executed 
> as part of a typical pipeline line.
> In particular, MERT runs with 4g which is not enough. We should increase this 
> to something like 8g or more.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)



Re: [jira] [Closed] (JOSHUA-100) Add Shen et al. (2008) dependency LM

2016-10-27 Thread Matt Post
Lewis — why are you marking these as fixed? This is better classified as 
dropped or no longer needed.



> On Oct 26, 2016, at 3:28 AM, Lewis John McGibbney (JIRA) <j...@apache.org> 
> wrote:
> 
> 
> [ 
> https://issues.apache.org/jira/browse/JOSHUA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Lewis John McGibbney closed JOSHUA-100.
> ---
>Resolution: Fixed
> 
>> Add Shen et al. (2008) dependency LM
>> 
>> 
>>Key: JOSHUA-100
>>URL: https://issues.apache.org/jira/browse/JOSHUA-100
>>Project: Joshua
>> Issue Type: New Feature
>>   Reporter: Matt Post
>>   Assignee: Matt Post
>>Fix For: 6.1
>> 
>> 
> 
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)



Re: Pipeline Mystery

2016-10-27 Thread Matt Post
yes mert must be dying. Can you post the contents of the tune/ directory? and 
tail mert.log?

matt (from my phone)

> Le 27 oct. 2016 à 00:49, John Hewitt  a écrit :
> 
> It seems like MERT isn't writing it's final config file (which is typical
> of MERT, in my experience). I recall giving up and using kbmira. This final
> config file is the one used in test, so I can see why skipping to test ends
> up failing pretty quick.
> 
> To answer your question, though, I haven't tried. Not in my bandwidth right
> now.
> 
> -John
> 
> On Thu, Oct 27, 2016 at 12:44 AM, lewis john mcgibbney 
> wrote:
> 
>> Hi Folks,
>> So I've been plodding away again and feel i am very close to generating my
>> first language pack, however I've arrived at the following fankle!!!
>> If I run a pipeline from start to finish it fails at the 'test-bundle-1'
>> phase as below stating " [Errno 2] No such file or directory:
>> '/usr/local/joshua_resources/russian_experiments/exp3/tune/
>> joshua.config.final'"
>> 
>> lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments/exp3 $
>> /usr/local/incubator-joshua/bin/pipeline.pl  --rundir . --type hiero
>> --corpus
>> /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en
>> --tune
>> /usr/local/joshua_resources/russian_experiments/data/
>> commoncrawl.ru-en.tune
>> --test
>> /usr/local/joshua_resources/russian_experiments/data/
>> commoncrawl.ru-en.test
>> --source en --target ru --readme "Experiment 3 Run 1 of ru --> en model
>> training" --aligner berkeley --hadoop-mem 10g --tmp
>> /usr/local/hadoop-2.5.2/hadoop_tmp_dir
>> [train-copy-and-filter] cached, skipping...
>> [train-tokenize-en] cached, skipping...
>> [train-tokenize-ru] cached, skipping...
>> [train-trim] cached, skipping...
>> [train-lowercase-en] cached, skipping...
>> [train-lowercase-ru] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [train-vocab-ru] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-tokenize-en] cached, skipping...
>> [tune-tokenize-ru] cached, skipping...
>> [tune-lowercase-en] cached, skipping...
>> [tune-lowercase-ru] cached, skipping...
>> [tune-vocab-en] cached, skipping...
>> [tune-vocab-ru] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-tokenize-en] cached, skipping...
>> [test-tokenize-ru] cached, skipping...
>> [test-lowercase-en] cached, skipping...
>> [test-lowercase-ru] cached, skipping...
>> [test-vocab-en] cached, skipping...
>> [test-vocab-ru] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>> 
>> dep=/usr/local/joshua_resources/russian_experiments/
>> exp3/data/tune/corpus.en
>> 
>> dep=/usr/local/joshua_resources/russian_experiments/
>> exp3/tune/joshua.config
>> [CHANGED]
>>  dep=tune/model/grammar.gz.packed/slice_0.source
>> 
>> dep=/usr/local/joshua_resources/russian_experiments/
>> exp3/tune/joshua.config.final
>> [NOT FOUND]
>>  cmd=/usr/local/incubator-joshua/scripts/training/run_tuner.py
>> /usr/local/joshua_resources/russian_experiments/exp3/data/tune/corpus.en
>> /usr/local/joshua_resources/russian_experiments/exp3/data/tune/corpus.ru
>> --tunedir /usr/local/joshua_resources/russian_experiments/exp3/tune
>> --tuner
>> mert --decoder
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/decoder_command
>> --decoder-config
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.config
>> --decoder-output-file
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/output.nbest
>> --decoder-log-file
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.log
>> --iterations 10 --metric 'BLEU 4 closest'
>>  took 27 seconds (27s)
>> [test-bundle-1] rebuilding...
>> 
>> dep=/usr/local/joshua_resources/russian_experiments/
>> exp3/tune/joshua.config.final
>> [NOT FOUND]
>>  dep=grammar.gz
>> 
>> dep=/usr/local/joshua_resources/russian_experiments/
>> exp3/test/1/model/joshua.config
>>  cmd=/usr/local/incubator-joshua/scripts/support/run_bundler.py --force
>> --symlink --absolute --verbose -T /usr/local/hadoop-2.5.2/hadoop_tmp_dir
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/
>> joshua.config.final
>> /usr/local/joshua_resources/russian_experiments/exp3/test/1/model
>> --copy-config-options '-top-n 300 -pop-limit 5000 -output-format "%i ||| %s
>> ||| %f ||| %c" -mark-oovs false' --pack-tm grammar.gz --tm
>> /usr/local/joshua_resources/russian_experiments/exp3/data/
>> tune/grammar.glue
>>  JOB FAILED (return code 2)
>> ERROR:root:ERROR: argument config: can't open
>> '/usr/local/joshua_resources/russian_experiments/exp3/tune/
>> joshua.config.final':
>> [Errno 2] No such file or directory:
>> '/usr/local/joshua_resources/russian_experiments/exp3/tune/
>> joshua.config.final'
>> 
>> However, if I run the pipeline with the 

Re: Lewis Volunteering for 6.1 Release Manager

2016-11-10 Thread Matt Post
Just landing back in the states from Berlin. This sounds great Lewis!

matt (from my phone)

> Le 10 nov. 2016 à 12:02, lewis john mcgibbney  a écrit :
> 
> Hi Folks,
> I would like to put myself forward as release manager for 6.1.
> I've got a lot of experience working with Incubating releases and have been
> successful in the position of release manager resulting in the release of
> around 20-30 official incubating and top level projects here at Apache.
> I'll make sure to document the entire release procedure on our wiki for
> future reference.
> Does anyone object? If not then I will get to it today.
> Lewis
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: Joshua 6.1

2016-10-19 Thread Matt Post
- I'm happy to release the language packs with an Apache 2.0 license.

- It looks like there's quite a bit of paperwork involved with the release. Is 
anyone available to help out with this or even head it up?

- We ran into a hitch building language packs, but have resumed and most of 
them are almost done. We should have over 60.

- Meanwhile the world is changing and the neural approach is becoming more and 
more obviously the right thing to do. I have some ideas on how this fits in to 
Joshua which I'll send out in another email.

matt



> On Oct 16, 2016, at 4:40 PM, lewis john mcgibbney <lewi...@apache.org> wrote:
> 
> Hi Matt,
> I like the sound of this :)
> 
> On Fri, Oct 14, 2016 at 9:25 AM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
> 
>> 
>> From: Matt Post <p...@cs.jhu.edu>
>> To: dev@joshua.incubator.apache.org
>> Cc:
>> Date: Thu, 13 Oct 2016 12:58:47 -0400
>> Subject: Joshua 6.1
>> Hi folks,
>> 
>> I think I'm going to do the 6.1 release tomorrow. Any objections?
>> 
> 
> No none at all!
> 
> 
>> 
>> Along with the release will be about 60 language packs for a large range
>> of languages. These will be released early next week and will be built on
>> BerkeleyLM, so that there are no external dependencies.
>> 
> 
> Sounds grand. As stated, it would be really cool if these could also be
> ALv2.0 licensed.
> 
> 
>> 
>> I'd like to push out the release quietly until the language packs are
>> ready, uploaded, and linked.
>> 
> 
> Cool.
> 
> 
>> 
>> Is there anything I need to know to do an Apache release?
>> 
>> 
> Yes a few things. You can see the incubator release checklist at
> http://incubator.apache.org/guides/releasemanagement.html#check-list
> There is also some more general documentation available at
> http://incubator.apache.org/guides/graduation.html#releases, which will
> eventually lead you to the release check list anyways.
> If you have any issues then lets hash them out on this thread. Please note
> that we need to review and VOTE prior to anything being pushed. We then
> need to go to the Incubator PMC to get wider approval before shipping the
> release. This 'can' be a bit painful... however from experience, if we 1)
> document the release management procedure on our wiki, and 2) iron out any
> issues within dev@joshua before we go to general@incubator then I am sure
> we will not encounter too many issues.
> Lewis



Re: Joshua 6.1

2016-10-14 Thread Matt Post
I don't see why not?


> On Oct 14, 2016, at 3:36 AM, Tommaso Teofili <tommaso.teof...@gmail.com> 
> wrote:
> 
> Hi Matt,
> 
> thanks for pushing this forward, +1 from me.
> One concern I have is related to the language packs licensing, can we
> distribute them under AL2 license ? (as "convenience" binaries as the
> official release consists of the Joshua source code).
> I'm asking this because in OpenNLP we have had this long time issue of the
> models licensing.
> 
> Regards,
> Tommaso
> 
> 
> 
> Il giorno gio 13 ott 2016 alle ore 18:58 Matt Post <p...@cs.jhu.edu> ha
> scritto:
> 
>> Hi folks,
>> 
>> I think I'm going to do the 6.1 release tomorrow. Any objections?
>> 
>> Along with the release will be about 60 language packs for a large range
>> of languages. These will be released early next week and will be built on
>> BerkeleyLM, so that there are no external dependencies.
>> 
>> I'd like to push out the release quietly until the language packs are
>> ready, uploaded, and linked.
>> 
>> Is there anything I need to know to do an Apache release?
>> 
>> matt
>> 
>> 
>> 



Re: Joshua Model Input Format(s) and LM Loading

2016-10-25 Thread Matt Post
Hi Lewis,

Joshua supports two language model representation packages: KenLM [0] and 
BerkeleyLM [1]. These were both developed at about the same time, and 
represented huge gains in doing this task efficiently, over what had previously 
been the standard approach (SRILM). Ken Heafield (who has contributed a lot to 
Joshua) went on to contribute a lot of other improvements to language model 
representation, decoder integration, and also the actual construction of 
language models and their efficient interpolation. His goal for a while was to 
make SRILM completely unnecessary, and I think he succeeded.

BerkeleyLM was more of a one-off project. It is slower than KenLM and hasn't 
been touched in years. If you want to understand, your efforts are probably 
best spent looking into KenLM papers. But it's also worth noting that Ken is a 
crack C++ programmer who has spent years hacking away on these problems, and 
your chances of finding any further efficiencies there are probably quite 
limited unless you have a lot of background in the area. But even if you did, I 
would recommend you not spend your time that way — I basically consider the LM 
representation problem to have been solved by KenLM. That's not to say that 
there are some improvements to be had on the Joshua / JNI bridge, but even 
there, there are probably better things to do.

matt

[0] KenLM: Faster and Smaller Language Model Queries
http://www.kheafield.com/professional/avenue/kenlm.pdf

[1] Faster and Smaller N-Gram Language Models
http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf




> On Oct 24, 2016, at 10:21 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> I have set out with the aim of learning more about the underlying Joshua
> language model serialization(s) e.g. statistical n-gram model in ARPA
> format [0] as well as trying to JProfile a Joshua server running to better
> understand how objects are used and what runtime memory usage looks like
> for typical translation tasks.
> This has lead me to think about the fundamental performance issues we
> experience when loading large LM's into memory in the first place... and
> the efficiency of searching models regardless of whether they are cached in
> memory (e.g. Joshua server), or not.
> Does anyone have detailed technical/journal documentation which would set
> me in the right direction to address the above area?
> Thanks
> Lewis
> 
> [0]
> http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats#statistical_n-gram_models_in_the_arpa_format
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: language pack #1

2016-10-25 Thread Matt Post
Hi Lewis,

I have parameters to set the default amount of memory when building the 
language pack. The comment therein is just boilerplate that didn't get 
parameterized. I'll add that to the script. In general, memory usage can be 
heuristically set to the size of the model files that are loaded (the grammar 
and the language model).

Great to hear that things are working well for you!

matt


> On Oct 24, 2016, at 11:48 PM, lewis john mcgibbney <lewi...@apache.org> wrote:
> 
> Hi Matt,
> I got around to testing out the language pack you posted and have a few
> suggestions.
> 
>   -  The Joshua bash script states in a number of places that ..."# The
>   default amount of memory is 4gb". This is not true as it is set to a
>   different (higher) number by default.
>   - When starting the Joshua server, I monitored memory usage (JProfiler)
>   and it seems to somewhat stabilize and linger at around 5 1/2 GB. Is this
>   normal based on the sie of the Berkeley LM?
>   - Translations are working pretty damn well. I've run a large amount of
>   current Spanish text relating to current news stories and the output looks
>   pretty comprehensive.
> 
> It would be great if we could update the Joshua Homebrew recipe with this
> language pack and also link to the pack from the Wiki.
> 
> Lewis
> 
> On Mon, Oct 10, 2016 at 2:48 AM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
> 
>> 
>> From: Matt Post <p...@cs.jhu.edu>
>> To: dev@joshua.incubator.apache.org
>> Cc:
>> Date: Fri, 7 Oct 2016 11:51:41 -0400
>> Subject: Re: language pack #1
>> That would be awesome.
>> 
>> 



Re: Thrax Error in WordLexicalProbabilityCalculator - Word id 2146928632 out of range 0 1727042

2016-10-21 Thread Matt Post
This is strange. I haven't looked into this again but don't have any insights. 
Thanks for the followup.


> On Oct 21, 2016, at 3:35 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> Follow up.
> It seems that when I clean the .cachepipe as well as all of the existing
> alignments, etc from the previous run and re-run the entire pipeline then
> this issue disappears.
> I have no real reason why this happened. All i can say is that it is of
> course best to run experiments in different directories when you make a
> tweak to a pipeline.
> Lewis
> 
> On Thu, Oct 20, 2016 at 12:20 AM, lewis john mcgibbney 
> wrote:
> 
>> Hi dev@,
>> 
>> Sitting facing some issues with Thrax using Joshua master branch.
>> I invoke Joshua as follows
>> 
>> /usr/local/incubator-joshua/bin/pipeline.pl  --rundir . --type hiero
>> --corpus 
>> /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en
>> --tune 
>> /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.tune
>> --test 
>> /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.test
>> --source en --target ru --readme "Experiment 1 Run 1 of ru --> en model
>> training" --aligner berkeley --tmp /usr/local/hadoop-2.5.2/hadoop_tmp_dir
>> --first-step thrax --no-prepare --alignment alignments/training.align
>> --hadoop-mem 10g
>> 
>> I make the first step thrax as I have previously computed my alignment as
>> indicated by the arguments.
>> My Thrax log is available at https://www.dropbox.com/s/
>> pxld70ki656fn13/thrax.log?dl=0. In the log you will see an exception as
>> follows
>> 
>> 16/10/19 22:56:59 WARN mapred.LocalJobRunner: job_local1314413872_0002
>> java.lang.Exception: java.lang.RuntimeException: Word id 2146928632 out
>> of range 0 1727042
>>at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
>> LocalJobRunner.java:462)
>>at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>> LocalJobRunner.java:522)
>> Caused by: java.lang.RuntimeException: Word id 2146928632 out of range 0
>> 1727042
>>at edu.jhu.thrax.hadoop.features.WordLexicalProbabilityCalculat
>> or$Partition.getPartition(WordLexicalProbabilityCalculator.java:133)
>>at edu.jhu.thrax.hadoop.features.WordLexicalProbabilityCalculat
>> or$Partition.getPartition(WordLexicalProbabilityCalculator.java:121)
>>at org.apache.hadoop.mapred.MapTask$NewOutputCollector.
>> write(MapTask.java:692)
>>at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(
>> TaskInputOutputContextImpl.java:89)
>>at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.
>> write(WrappedMapper.java:112)
>>at edu.jhu.thrax.hadoop.features.WordLexicalProbabilityCalculat
>> or$Map.map(WordLexicalProbabilityCalculator.java:82)
>>at edu.jhu.thrax.hadoop.features.WordLexicalProbabilityCalculat
>> or$Map.map(WordLexicalProbabilityCalculator.java:28)
>>at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(
>> LocalJobRunner.java:243)
>>at java.util.concurrent.Executors$RunnableAdapter.
>> call(Executors.java:511)
>>at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1142)
>>at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:617)
>>at java.lang.Thread.run(Thread.java:745)
>> 
>> I see no other issues until the end of the Thrax log where I see
>> 
>> class edu.jhu.thrax.hadoop.jobs.TargetWordGivenSourceWordProbabilityJob
>> FAILED
>> class edu.jhu.thrax.hadoop.jobs.OutputJobPREREQ_FAILED
>> class edu.jhu.thrax.hadoop.features.annotation.AnnotationFeatureJob
>> PREREQ_FAILED
>> class edu.jhu.thrax.hadoop.features.mapred.TargetPhraseGivenSourceFeature
>> SUCCESS
>> class edu.jhu.thrax.hadoop.jobs.ExtractionJobSUCCESS
>> class edu.jhu.thrax.hadoop.features.mapred.SourcePhraseGivenTargetFeature
>> SUCCESS
>> class edu.jhu.thrax.hadoop.jobs.VocabularyJobSUCCESS
>> class edu.jhu.thrax.hadoop.jobs.SourceWordGivenTargetWordProbabilityJob
>> FAILED
>> 
>> This issue has previously been reported by Matt over on
>> https://github.com/joshua-decoder/thrax/issues/10
>> 
>> Debugging right now folks.
>> Lewis
>> 
>> --
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
>> 
> 
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: [VOTE] Release Apache Joshua (Incubating) 6.1

2016-11-14 Thread Matt Post
+1

Thanks for starting this off, Lewis!


> On Nov 14, 2016, at 12:54 PM, Ramirez, Paul M (398M) 
>  wrote:
> 
> +1, let's get it released!!!
> 
> --Paul
> 
> ==
> Paul Ramirez - Group Supervisor
> Computer Science for Data Intensive Applications (398M)
> NASA - Jet Propulsion Laboratory
> 4800 Oak Grove Dr.
> Pasadena, CA 91109 USA
> Mailstop: 158-242
> Office: 818-354-1015
> Cell: 818-395-8194
> ==
> 
> On 11/14/16, 9:16 AM, "lewis john mcgibbney"  wrote:
> 
>Hi Folks,
>Please VOTE on the Apache Joshua 6.1 Release Candidate #1.
> 
>We solved 44 issues: https://s.apache.org/joshua6.1
> 
>Git source tag (167489bbd78526b9833fe7c88646bf96101d5d2b):
>https://s.apache.org/joshua6.1tag
> 
>Staging repo:
>https://repository.apache.org/content/repositories/orgapachejoshua-1000/
> 
>Source Release Artifacts:
>https://dist.apache.org/repos/dist/dev/incubator/joshua/
> 
>PGP release keys (signed using 48BAEBF6):
>https://dist.apache.org/repos/dist/release/incubator/joshua/KEYS
> 
>Vote will be open for 72 hours.
>Thank you to everyone that is able to VOTE as well as everyone that
>contributed to Apache Joshua 6.1.
> 
>[ ] +1, let's get it released!!!
>[ ] +/-0, fine, but consider to fix few issues before...
>[ ] -1, nope, because... (and please explain why)
> 
>P.S. here is my +1
> 
>-- 
>http://home.apache.org/~lewismc/
>@hectorMcSpector
>http://www.linkedin.com/in/lmcgibbney
> 
> 



Re: language packs blog post

2016-11-21 Thread Matt Post
That's better, fixed.


> On Nov 21, 2016, at 3:14 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Looks good to me, no objection to tweeting it.  Nice work putting them all
> together.
> 
> On Mon, Nov 21, 2016 at 9:00 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Hi folks,
>> 
>> I just drafted this; any objections to tweeting it?
>> 
>>https://cwiki.apache.org/confluence/display/JOSHUA/
>> 2016/11/21/Apache+Joshua+Language+Packs
>> 
>> matt



language packs blog post

2016-11-21 Thread Matt Post
Hi folks,

I just drafted this; any objections to tweeting it?


https://cwiki.apache.org/confluence/display/JOSHUA/2016/11/21/Apache+Joshua+Language+Packs

matt

Re: Dockerhub hosted images

2016-11-23 Thread Matt Post
Kellen, can I bother you to post a few first steps? I've successfully pulled 
this down to my mac but now do not know how to find it, edit it, or run it. I'm 
porting through the documentation and will find it eventually but this would 
save me a bit of time.


> On Nov 23, 2016, at 8:07 AM, kellen sunderland  
> wrote:
> 
> Yes my next step was going to be getting it hosted officially.
> 
> I'll go ahead and open a ticket.  I think I'll hold off on pushing to the
> Apache account until I've done a little more testing though.
> 
> On Nov 23, 2016 5:22 AM, "lewis john mcgibbney"  wrote:
> 
>> Hi Kellen,
>> Nice :)
>> Another option is for us to host these via the Apache account.
>> https://hub.docker.com/r/apache/
>> We could then add a badge to our README which points to the Dockerfile(s).
>> Do you want to open a ticket over on the INFRA Jira for this?
>> 
>> On Tue, Nov 22, 2016 at 1:57 PM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>> 
>>> From: kellen sunderland 
>>> To: "dev@joshua.incubator.apache.org" 
>>> Cc:
>>> Date: Tue, 22 Nov 2016 22:56:56 +0100
>>> Subject: Re: Dockerhub hosted images
>>> Ok, the first image should be properly uploaded now.
>>> 
>>> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
>>> 
>>> -Kellen
>>> 
>>> 
>> 



Re: [VOTE] Release Apache Joshua 6.1 RC#2

2016-11-23 Thread Matt Post
+1 Thanks, Lewis!


> On Nov 23, 2016, at 12:15 AM, lewis john mcgibbney  wrote:
> 
> Hello user@ and dev,
> Please VOTE on the Apache Joshua 6.1 Release Candidate #2.
> 
> We solved 50 issues: https://s.apache.org/joshua6.1
> 
> Git source tag (29c8be650d53216f779a340d33f8f61af4d45629):
> https://s.apache.org/pk2t 
> 
> Staging repo:
> https://repository.apache.org/content/repositories/orgapachejoshua-1001/
> 
> 
> Source Release Artifacts: https://dist.apache.org/repos/
> dist/dev/incubator/joshua/
> 
> PGP release keys (signed using 48BAEBF6): https://dist.apache.org/repos/
> dist/release/incubator/joshua/KEYS
> 
> Vote will be open for 72 hours.
> Thank you to everyone that is able to VOTE as well as everyone that
> contributed to Apache Joshua 6.1.
> 
> [ ] +1, let's get it released!!!
> [ ] +/-0, fine, but consider to fix few issues before...
> [ ] -1, nope, because... (and please explain why)
> 
> P.S. here is my +1
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



test non apache account

2016-11-23 Thread Matt Post


matt (from my phone)


Re: Dockerhub hosted images

2016-11-23 Thread Matt Post
Okay, I have this with

docker run -it kellens/apache-joshua-es-en-2016-10-05 bash

It seems we are missing Perl (./prepare.sh fails), and we should replace the 
LanguageModel line with a KenLM instance and build that. I bet we'll need 
Python, too.




> On Nov 23, 2016, at 8:15 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> Kellen, can I bother you to post a few first steps? I've successfully pulled 
> this down to my mac but now do not know how to find it, edit it, or run it. 
> I'm porting through the documentation and will find it eventually but this 
> would save me a bit of time.
> 
> 
>> On Nov 23, 2016, at 8:07 AM, kellen sunderland <kellen.sunderl...@gmail.com> 
>> wrote:
>> 
>> Yes my next step was going to be getting it hosted officially.
>> 
>> I'll go ahead and open a ticket.  I think I'll hold off on pushing to the
>> Apache account until I've done a little more testing though.
>> 
>> On Nov 23, 2016 5:22 AM, "lewis john mcgibbney" <lewi...@apache.org> wrote:
>> 
>>> Hi Kellen,
>>> Nice :)
>>> Another option is for us to host these via the Apache account.
>>> https://hub.docker.com/r/apache/
>>> We could then add a badge to our README which points to the Dockerfile(s).
>>> Do you want to open a ticket over on the INFRA Jira for this?
>>> 
>>> On Tue, Nov 22, 2016 at 1:57 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
>>>> From: kellen sunderland <kellen.sunderl...@gmail.com>
>>>> To: "dev@joshua.incubator.apache.org" <dev@joshua.incubator.apache.org>
>>>> Cc:
>>>> Date: Tue, 22 Nov 2016 22:56:56 +0100
>>>> Subject: Re: Dockerhub hosted images
>>>> Ok, the first image should be properly uploaded now.
>>>> 
>>>> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
>>>> 
>>>> -Kellen
>>>> 
>>>> 
>>> 
> 



Re: Any symal experts?

2016-11-23 Thread Matt Post
I think it will be much less of a headache. The GIZA++ code is notorious for 
being unreadable, and the Perl piece of that pipeline only hurts (even though 
Philipp's Perl is unusually clear). I think adding atools to your port is the 
way to go, and that it's written in C++ should facilitate that.




> On Nov 23, 2016, at 12:25 PM, John Hewitt <john...@seas.upenn.edu> wrote:
> 
> It'll be a headache because it also has no documentation, but to be fair it
> may be less of a headache / a better long-term solution than trying to move
> forward with this hackier solution.
> 
> I'll keep the symal use on the backburner and start putting together an
> atools port.
> 
> -John
> 
> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>> indeed replaced them with "atools"; how much work would it be to port that?
>> 
>> 
>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <john...@seas.upenn.edu>
>> wrote:
>>> 
>>> Hey everyone,
>>> 
>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>> into
>>> the pipeline.
>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>> that I haven't ported to Java.
>>> We package symal (which symmetricizes alignments) with Joshua right now
>> for
>>> GIZA++, so I'm attempting to re-use that.
>>> However, symal uses the .bal format, which it fails to describe.
>>> It gets away with this because files from GIZA++ are piped through
>>> giza2bal.pl, which itself is not well documented.
>>> I'm attempting to write, say, fastalign2bal.py.
>>> With a bit of tinkering, I got at the .bal format:
>>> 
>>> 1
>>> 
>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>> 
>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>> 
>>> A template for which would be
>>> 
>>> 1
>>> 
>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> 
>>> 
>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>> some fastalign2bal.py output.
>>> A few hours with gdb made some progress (for as far as I can tell, the
>>> formats are identical) but if anyone has experience with symal, I would
>>> greatly appreciate some consultation.
>>> 
>>> -John
>> 
>> 



Re: Any symal experts?

2016-11-23 Thread Matt Post
John — I suggest trying to ditch those GIZA++ tools entirely. fast_align indeed 
replaced them with "atools"; how much work would it be to port that?


> On Nov 23, 2016, at 12:11 PM, John Hewitt  wrote:
> 
> Hey everyone,
> 
> I'm packaging up a Java port Fast Align for Joshua and integrating it into
> the pipeline.
> Fast Align does not produce symmetrical alignments -- it relies on a tool
> that I haven't ported to Java.
> We package symal (which symmetricizes alignments) with Joshua right now for
> GIZA++, so I'm attempting to re-use that.
> However, symal uses the .bal format, which it fails to describe.
> It gets away with this because files from GIZA++ are piped through
> giza2bal.pl, which itself is not well documented.
> I'm attempting to write, say, fastalign2bal.py.
> With a bit of tinkering, I got at the .bal format:
> 
> 1
> 
> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> 
> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> 
> A template for which would be
> 
> 1
> 
> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> alignment2 ... alignmentN]
> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> alignment2 ... alignmentN]
> 
> 
> However, I'm hitting some pretty nasty errors with symal when I pipe in
> some fastalign2bal.py output.
> A few hours with gdb made some progress (for as far as I can tell, the
> formats are identical) but if anyone has experience with symal, I would
> greatly appreciate some consultation.
> 
> -John



Re: Downloading of non ASF licensed code

2016-11-28 Thread Matt Post
This would be easy to do. Maybe just a simple prompt that alerts the user? 
Something like

echo "Warning: this script downloads many tools used in building and 
running"
echo "Joshua. Not all of them are Apache Licensed. If you wish to 
continue, hit Enter".
read j
if [[ ! -z $j ]]; then
echo "Quitting."
fi



> On Nov 25, 2016, at 10:41 AM, Tom Barber  wrote:
> 
> This may have come up before in the whole licensing chat so apologies if
> I'm just going over old ground.
> 
> The download-deps.sh file obviously downloads and builds stuff with non ASF
> licenses, I realise this is for model training purposes only, and 99.9%
> wont care, but should we consider putting a prompt into that script warning
> people. I ask because a company might add in the training modules blindly
> assuming because the script is distributed by the ASF the modules are also
> ASL2.0.
> 
> Just a thought.
> 
> Tom
> 
> -- 
> Tom Barber
> CTO Spicule LTD
> t...@spicule.co.uk
> 
> http://spicule.co.uk
> 
> GB: +44(0)5603641316
> US: +18448141689



★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-11-28 Thread Matt Post
One project I think could be interesting for Joshua's future is sketched here.

- Dynamic phrase tables. Joshua currently lets people add custom phrases to the 
existing models that then get used. There is a research topic here for how to 
make it better (particularly, how to set the weights of rules that are added at 
runtime instead of learned from bitext), but it works really well for adding 
words that are OOV (since it's always cheaper to use the OOV). Here's a demo of 
how this works (this feature is included in the language packs). 


https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables

- Translation memories. There is a large commercial market (billions) for tools 
called "translation memories", where translators are translating documents, and 
the sentences get queried against their past translations and matched in a 
fuzzy fashion. The big tool on the market for this is SDL Trados 
. 
I'm not talking about selling a product, but in a space that big, there have 
got to be a lot of people who'd rather just run their own system, than shell 
out for an expensive (and ugly) tool. So there is a big niche for an open 
source tool, and currently nothing really filling it. The "dynamic phrase 
table" feature above provides the beginnings of offering a TM competitor, but 
one that is "seeded" with a regular statistical machine translation model.

- Dynamic re-tuning. One thing that'd be *really* cool is to revamp the tuning 
infrastructure in Joshua. The use-case I imagine is that Joshua could sit on 
top of a large tuning set across diverse domains (e.g, formal news, informal 
web logs, spoken dialogue, etc). You could then add new phrases in sentences as 
above, which would get automatically aligned, and then everything could be 
retuned at the user's request (or perhaps at night). This way, when people 
added new data to their models, Joshua would automatically find the best 
weights, either immediately or on some schedule. There'd be less worry about 
bit rot.

- Data collection and sharing. Another cool idea would be to allow people to 
easily send us data. If we get to a place where people are building custom 
dynamic phrase tables, a cool ability would be to make it easy for people to 
upload the data they have added to their private systems, which we could then 
collect and further distribute. So Joshua could become an easy means for people 
to crowdsource data used for translation systems. This is obviously just a 
high-level idea that would require a lot of details to be figured out, but it 
would be super cool.

matt

Re: Dockerhub hosted images

2016-11-22 Thread Matt Post
How do I clone this? Docker tells me there is no tag "latest", using "-a" tells 
me the repo is not found, and I can't seem to figure out how to tell Docker to 
use hub.docker.com...


> Here's a link to the first image I've been playing with, es-en.
> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/




Re: "mvn assembly" no longer works

2016-11-17 Thread Matt Post
Ah, thanks Lewis. I did update the README to mention the new package target.



> On Nov 17, 2016, at 1:36 AM, lewis john mcgibbney  wrote:
> 
> Hi Matt,
> Again, I am on digest and didn't receive but I'll reply here.
> No need to use the Maven assembly plugin anymore... simply execute mvn 
> package... you will then see 
> ./target/joshua-6.2-SNAPSHOT-jar-with-dependencies.jar the exact same, but 
> now a default Maven task rather than a custom plugin implementation.
> Do we need to update README?
> 
> -- 
> http://home.apache.org/~lewismc/ 
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney 



Re: Updating Incubator summary

2016-11-17 Thread Matt Post
My thinking on that roadmap was a comment Lewis made a while ago about 
incubator graduation being judged by the number of releases. If you think we 
can get out sooner, then I'm all for it! Maybe we can get the docker containers 
out and then push for it after that?

I like your idea about a more concerted advertising effort. We could also try 
to pull together a demo paper for ACL <http://acl2017.org/>  which is due in 
February. I think I might have a hook that would appeal to reviewers there.


> On Nov 17, 2016, at 2:12 AM, Henri Yandell <bay...@apache.org> wrote:
> 
> Sounds good :)
> 
> My basic mantra is 'get the summary page all signed off, then start asking
> "when graduate?"'. Projects can tend to linger in the Incubator awaiting
> perfection.
> 
> I wonder how you could take the 3rd item (Linux.com article) and make that
> bigger. Perhaps encourage every committer to write a blog post so you end
> up with the article as an intro, and then each committer's blog entry or
> website hosted article as a personal "how I got into this" or "what I work
> on" or "a commit I recently did, a commit I keep meaning to getting around
> to working on". Random thought :)
> 
> Hen
> 
> On Tue, Nov 15, 2016 at 11:09 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> We're still waiting on our first software release, so it seems to me a bit
>> premature to graduate? Though I don't know how these decisions are made —
>> what goes into it?
>> 
>> Here is the roadmap that I have in mind:
>> 
>> - 6.1 release (imminent)
>> - Large-scale release of language packs (imminent)
>> - Linux.com article introducing people to MT, Joshua, language packs, and
>> adding custom rules
>> - Release of docker-based language packs (including KenLM)
>> - 7.0 release (spring)
>> - Graduate
>> 
>> If we keep that rough schedule, we'll have incubated a year and have a lot
>> to show for it.
>> 
>> matt
>> 
>> 
>>> On Nov 15, 2016, at 12:13 PM, Henri Yandell <bay...@apache.org> wrote:
>>> 
>>> Thanks :)
>>> 
>>> Reason for asking being that it felt that the standard checklist things
>>> were complete and I was wondering what the path to graduation is?
>>> 
>>> Any reason not to start thinking about a vote?
>>> 
>>> On Tue, Nov 15, 2016 at 04:02 Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>>> Thanks, Lewis, and Henri, for pointing this out.
>>>> 
>>>> 
>>>>> On Nov 15, 2016, at 1:18 AM, lewis john mcgibbney <lewi...@apache.org>
>>>> wrote:
>>>>> 
>>>>> Hi Henri,
>>>>> I just pushed the update to SVN. Should update asynch reasonably soon.
>>>>> 
>>>>> http://incubator.apache.org/projects/joshua.html
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Sun, Nov 13, 2016 at 1:22 PM, <
>>>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>>>> 
>>>>>> 
>>>>>> From: Henri Yandell <bay...@apache.org>
>>>>>> To: dev@joshua.incubator.apache.org
>>>>>> Cc:
>>>>>> Date: Sun, 13 Nov 2016 01:17:57 -0800
>>>>>> Subject: Updating Incubator summary
>>>>>> Would be useful to update this page:
>>>>>> 
>>>>>> http://incubator.apache.org/projects/joshua.html
>>>>>> 
>>>>>> 
>>>>>> Are there any of the checklist items that are still open?
>>>>>> 
>>>>>> 
>>>>> As far as I am aware no :)
>>>> 
>>>> 
>> 
>> 



package-info.java

2016-11-16 Thread Matt Post
Hi Thamme,

Eclipse is complaining about package-info.java files, e.g.,

The type package-info is already defined

for org.apache.joshua.decoder.package-info.java. I see that a while ago you 
replaced the package-info.html files with these. Is there a particular reason 
for this? Is .java preferred to .html? In researching solutions to this one of 
the suggestions was to go to .html.

matt

Re: Updating Incubator summary

2016-11-15 Thread Matt Post
We're still waiting on our first software release, so it seems to me a bit 
premature to graduate? Though I don't know how these decisions are made — what 
goes into it?

Here is the roadmap that I have in mind:

- 6.1 release (imminent)
- Large-scale release of language packs (imminent)
- Linux.com article introducing people to MT, Joshua, language packs, and 
adding custom rules
- Release of docker-based language packs (including KenLM)
- 7.0 release (spring)
- Graduate

If we keep that rough schedule, we'll have incubated a year and have a lot to 
show for it.

matt


> On Nov 15, 2016, at 12:13 PM, Henri Yandell <bay...@apache.org> wrote:
> 
> Thanks :)
> 
> Reason for asking being that it felt that the standard checklist things
> were complete and I was wondering what the path to graduation is?
> 
> Any reason not to start thinking about a vote?
> 
> On Tue, Nov 15, 2016 at 04:02 Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Thanks, Lewis, and Henri, for pointing this out.
>> 
>> 
>>> On Nov 15, 2016, at 1:18 AM, lewis john mcgibbney <lewi...@apache.org>
>> wrote:
>>> 
>>> Hi Henri,
>>> I just pushed the update to SVN. Should update asynch reasonably soon.
>>> 
>>> http://incubator.apache.org/projects/joshua.html
>>> 
>>> Thanks
>>> 
>>> On Sun, Nov 13, 2016 at 1:22 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
>>>> 
>>>> From: Henri Yandell <bay...@apache.org>
>>>> To: dev@joshua.incubator.apache.org
>>>> Cc:
>>>> Date: Sun, 13 Nov 2016 01:17:57 -0800
>>>> Subject: Updating Incubator summary
>>>> Would be useful to update this page:
>>>> 
>>>> http://incubator.apache.org/projects/joshua.html
>>>> 
>>>> 
>>>> Are there any of the checklist items that are still open?
>>>> 
>>>> 
>>> As far as I am aware no :)
>> 
>> 



  1   2   3   >