[jira] [Created] (JOSHUA-283) Implement fast_align as one of the available alignment options

2016-07-20 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created JOSHUA-283:
---

 Summary: Implement fast_align as one of the available alignment 
options
 Key: JOSHUA-283
 URL: https://issues.apache.org/jira/browse/JOSHUA-283
 Project: Joshua
  Issue Type: Bug
  Components: alignment, pipeline
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 6.1


For some time now, I've been having issues using GIZA++ for alignment whilst 
running a Joshua pipeline.
Whilst looking for an alternative [~post] and [~kellen.sunderland] mentioned 
the berkeley aligner and fast_align respectively.
Due to the fact that 1) berkeley aligner has not been touched in ~9 years, and 
2) no artifact currently exists on Maven Central, I am taking the advice and 
attempting to use fast_align.
This issue will augment the alignment code in Joshua to permit use of 
fast_align which is ALv2.0 licensed.

https://github.com/clab/fast_align 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Issue Building LM on master branch

2016-07-20 Thread Matt Post
I believe I cloned it there a while back because of the Google code expiration. 
Until recently the jar was just checked into Joshua. I'm not sure what the 
situation is now.

Probably we should put the Berkeley Aligner in the Maven codebase? Or just 
switch to fast_align (which is Apache licensed).

matt


> On Jul 20, 2016, at 11:07 AM, Lewis John Mcgibbney 
>  wrote:
> 
> @Matt,
> I noticed that you have a clone of some code representing berkelyaligner in
> your Github repos
> https://github.com/mjpost/berkeleyaligner
> Is this the most up-to-date code?
> Also, I see that berkeleyaligner.jar is called at the following line
> https://github.com/apache/incubator-joshua/blob/master/scripts/training/paralign.pl#L81
> 
> On Wed, Jul 20, 2016 at 7:54 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
> 
>> OK so as it turns out passing '--aligner berkeley' to the pipeline.pl
>> invocation does not currently work in master branch.
>> My log simply prints
>> 
>> Error: Unable to access jarfile
>> /usr/local/incubator-joshua/lib/berkeleyaligner.jar
>> 
>> I'll get this sorted out and submit a PR to try and fix.
>> Thanks
>> 
>> On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>> 
>>> Hi Kellen and Matt,
>>> 
>>> On Tue, Jul 19, 2016 at 8:20 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
 From: Matt Post 
 To: dev@joshua.incubator.apache.org
 Cc:
 Date: Sun, 17 Jul 2016 23:30:33 -0400
 Subject: Re: Issue Building LM on master branch
 Lewis — This is a good-sized dataset, and on a single desktop machine, I
 expect it would take at least a day to go all the way through alignment,
 model-building, and tuning.
 
>>> 
>>> OK thanks for the estimate.
>>> 
>>> 
 
 fast_align is a good idea, though it isn't integrated into the pipeline
 (shouldn't be too hard, and is on the list). You could also just try
 "--aligner berkeley" and see if that works.
 
>>> 
>>> I'll do exactly that. Starting with berkeley first and then moving on to
>>> fast_align. I'll update here with any progress.
>>> 
>>> 
 
 Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)?
 Sometimes GIZA doesn't compile correctly, and this could be an error where
 it doesn't find GIZA++ or one of the support binaries (mkcls, 
 snt2cooc.out).
 
 
>>> AFAICT I don't see any errors prior to the bottom dozen or so lines. I've
>>> put the log below and would greatly appreciate if you could have a look
>>> through it and provide some feedback.
>>> http://home.apache.org/~lewismc/giza.log
>>> I'll update this thread on the berkeley alignment outcome before shooting
>>> to use the fast_align.
>>> Thanks both again.
>>> Lewis
>>> 
>> 
>> 
>> 
>> --
>> *Lewis*
>> 
> 
> 
> 
> -- 
> *Lewis*



Re: Issue Building LM on master branch

2016-07-20 Thread Lewis John Mcgibbney
@Matt,
I noticed that you have a clone of some code representing berkelyaligner in
your Github repos
https://github.com/mjpost/berkeleyaligner
Is this the most up-to-date code?
Also, I see that berkeleyaligner.jar is called at the following line
https://github.com/apache/incubator-joshua/blob/master/scripts/training/paralign.pl#L81

On Wed, Jul 20, 2016 at 7:54 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> OK so as it turns out passing '--aligner berkeley' to the pipeline.pl
> invocation does not currently work in master branch.
> My log simply prints
>
> Error: Unable to access jarfile
> /usr/local/incubator-joshua/lib/berkeleyaligner.jar
>
> I'll get this sorted out and submit a PR to try and fix.
> Thanks
>
> On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
>> Hi Kellen and Matt,
>>
>> On Tue, Jul 19, 2016 at 8:20 PM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>
>>> From: Matt Post 
>>> To: dev@joshua.incubator.apache.org
>>> Cc:
>>> Date: Sun, 17 Jul 2016 23:30:33 -0400
>>> Subject: Re: Issue Building LM on master branch
>>> Lewis — This is a good-sized dataset, and on a single desktop machine, I
>>> expect it would take at least a day to go all the way through alignment,
>>> model-building, and tuning.
>>>
>>
>> OK thanks for the estimate.
>>
>>
>>>
>>> fast_align is a good idea, though it isn't integrated into the pipeline
>>> (shouldn't be too hard, and is on the list). You could also just try
>>> "--aligner berkeley" and see if that works.
>>>
>>
>> I'll do exactly that. Starting with berkeley first and then moving on to
>> fast_align. I'll update here with any progress.
>>
>>
>>>
>>> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)?
>>> Sometimes GIZA doesn't compile correctly, and this could be an error where
>>> it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).
>>>
>>>
>> AFAICT I don't see any errors prior to the bottom dozen or so lines. I've
>> put the log below and would greatly appreciate if you could have a look
>> through it and provide some feedback.
>> http://home.apache.org/~lewismc/giza.log
>> I'll update this thread on the berkeley alignment outcome before shooting
>> to use the fast_align.
>> Thanks both again.
>> Lewis
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*


Re: Issue Building LM on master branch

2016-07-20 Thread Lewis John Mcgibbney
OK so as it turns out passing '--aligner berkeley' to the pipeline.pl
invocation does not currently work in master branch.
My log simply prints

Error: Unable to access jarfile
/usr/local/incubator-joshua/lib/berkeleyaligner.jar

I'll get this sorted out and submit a PR to try and fix.
Thanks

On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Kellen and Matt,
>
> On Tue, Jul 19, 2016 at 8:20 PM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
>
>> From: Matt Post 
>> To: dev@joshua.incubator.apache.org
>> Cc:
>> Date: Sun, 17 Jul 2016 23:30:33 -0400
>> Subject: Re: Issue Building LM on master branch
>> Lewis — This is a good-sized dataset, and on a single desktop machine, I
>> expect it would take at least a day to go all the way through alignment,
>> model-building, and tuning.
>>
>
> OK thanks for the estimate.
>
>
>>
>> fast_align is a good idea, though it isn't integrated into the pipeline
>> (shouldn't be too hard, and is on the list). You could also just try
>> "--aligner berkeley" and see if that works.
>>
>
> I'll do exactly that. Starting with berkeley first and then moving on to
> fast_align. I'll update here with any progress.
>
>
>>
>> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)?
>> Sometimes GIZA doesn't compile correctly, and this could be an error where
>> it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).
>>
>>
> AFAICT I don't see any errors prior to the bottom dozen or so lines. I've
> put the log below and would greatly appreciate if you could have a look
> through it and provide some feedback.
> http://home.apache.org/~lewismc/giza.log
> I'll update this thread on the berkeley alignment outcome before shooting
> to use the fast_align.
> Thanks both again.
> Lewis
>



-- 
*Lewis*


Re: Issue Building LM on master branch

2016-07-20 Thread Lewis John Mcgibbney
Hi Kellen and Matt,

On Tue, Jul 19, 2016 at 8:20 PM, <
dev-digest-h...@joshua.incubator.apache.org> wrote:

> From: Matt Post 
> To: dev@joshua.incubator.apache.org
> Cc:
> Date: Sun, 17 Jul 2016 23:30:33 -0400
> Subject: Re: Issue Building LM on master branch
> Lewis — This is a good-sized dataset, and on a single desktop machine, I
> expect it would take at least a day to go all the way through alignment,
> model-building, and tuning.
>

OK thanks for the estimate.


>
> fast_align is a good idea, though it isn't integrated into the pipeline
> (shouldn't be too hard, and is on the list). You could also just try
> "--aligner berkeley" and see if that works.
>

I'll do exactly that. Starting with berkeley first and then moving on to
fast_align. I'll update here with any progress.


>
> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)?
> Sometimes GIZA doesn't compile correctly, and this could be an error where
> it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).
>
>
AFAICT I don't see any errors prior to the bottom dozen or so lines. I've
put the log below and would greatly appreciate if you could have a look
through it and provide some feedback.
http://home.apache.org/~lewismc/giza.log
I'll update this thread on the berkeley alignment outcome before shooting
to use the fast_align.
Thanks both again.
Lewis


Skip from grammer extractor

2016-07-20 Thread Arezoo Arjomand
Hi Dear,I am researcher in the field of natural language processing. 
Joshua-v5.0 was installed in Ubuntu 12. I want to align words in the parallel 
corpora and decode the test set with the word alignments. On the otherhand, I 
want to skip from grammar extractor and phrase alignment in Joshua decoder.I 
would be grateful if you would give me help.
Sincerely,Arezoo Arjomandzadeh

Fwd: FW: July 2016 Newsletter ­ LDC

2016-07-20 Thread Lewis John Mcgibbney
-- Forwarded message --
From: Mcgibbney, Lewis J (398M) 
Date: Tue, Jul 19, 2016 at 8:35 PM
Subject: FW: July 2016 Newsletter ­ LDC
To: Lewis John McGibbney 



Dr. Lewis John McGibbney Ph.D., B.Sc.
Data Scientist II
Computer Science for Data Intensive Applications Group 398M
Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive
Pasadena, California 91109-8099
Mail Stop : 158-256C
Tel:  (+1) (818)-393-7402
Cell: (+1) (626)-487-3476
Fax:  (+1) (818)-393-1190
Email: lewis.j.mcgibb...@jpl.nasa.gov



 Dare Mighty Things

From: Ldc-customers1  on behalf of
Penn LDC 
Date: Tuesday, July 19, 2016 at 1:41 PM
To: Penn LDC 
Subject: July 2016 Newsletter – LDC


*In this Newsletter:*

*Fall 2016 Data Scholarship Program*


*2015 User Survey Results*


*New Publications:*

*English Speed Networking Conversational Transcripts
*



*Digital Archive of Southern Speech - NLP Version (DASS)
*



*GALE Phase 3 and 4 Chinese Broadcast News Parallel Text
*



*IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
*







*Fall 2016 Data Scholarship Program*

Applications are now being accepted through *Thursday, September 15, 2016*
for the Fall 2016 LDC Data Scholarship program. The LDC Data Scholarship
program provides university students with access to LDC data at no-cost.

This program is open to students pursuing both undergraduate and graduate
studies in an accredited college or university. LDC Data Scholarships are
not restricted to any particular field of study; however, students must
demonstrate a well-developed research agenda and a bona fide inability to
pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a two-page proposal
describing their intended use of the data. The proposal should state which
data the student plans to use, how the data will benefit their research
project, the proposed methodology or algorithm which will be used and how
success will be measured.

Applicants should consult the Catalog  for
a complete list of data distributed by LDC. Due to certain restrictions, a
handful of LDC corpora are restricted to members of the Consortium.
Applicants are advised to select a maximum of one to two databases.

(2) Letter of Support. Applicants must submit one letter of support from
their thesis adviser or department chair. The letter must be signed and
printed on letterhead, describe the student and the research, evaluate the
probability of success and confirm that the department or university lacks
the funding to pay the full non-member fee for the data.

For further information on application materials and program rules, please
visit the LDC Data Scholarship page.




*2015 User Survey Results*

LDC conducted its fourth user survey in December 2015. This survey built on
the previous surveys conducted in 2006, 2007 and 2012 to assess user
sentiment and also asked for the evaluation of key LDC-related topics
including:

· Opinions on the new website and usability of the Catalog

· Use and satisfaction with the enhanced user services and
e-commerce system

· LDC’s Data Management Plan capabilities

· Suggestions for future publications and preferred data delivery
methods

· Use of web services for data access and processing

Overall, survey respondents were satisfied with LDC’s data, membership
options, website, Catalog and enhanced user services. Participants cited
the top five most useful corpora received between 2012 and 2015 as *OntoNotes
Release 5.0*, *TIMIT*, *TAC KBP Reference Knowledge Base*, *Penn Discourse
Treebank V 2.0*, and M*ulti-Channel WSJ Audio*. Three fourths of
respondents prefer digital delivery of data and the top three languages for
current research demands were identified as English, Chinese and Spanish.

We thank everyone who participated in this survey. Responses will benefit
the future of the Consortium and will help LDC to better meet the needs of
our members and data licensees.





*New Publications*



(1)* English Speed Networking Conversational Transcripts*
 was developed at the University
of the West of England  and contains 388 transcripts
of English face-to-face and instant messaging conversations  about business
ideas collected in 2014 and 2015 from participants (undergraduate students)
playing different power roles.



This corpus was created to examine communication accommodation,
specifically, the ways in which an individual's linguistic style is
affected by social power and personality. The data was collected in two
studies. In the first study, 40 participants had a seri