Re: [Apertium-stuff] [PATCH] supervised weighting of automata

nlhowell Tue, 18 Jun 2019 07:13:35 -0700

OK, tried to rewrite this to be (a bit more) readable; sorry if I
over-trimmed somewhere, it was hard keeping track of what the patch
context was. Next time I'll send out in message-per-patch format. (Or
maybe Amr can do it...? `man git-send-mail`)


(I think your mail client *double spaced* the reply! Wow!)

Cheers,
Nick

Amr Mohamed Hosny Anwar wrote:
> Nick Howell <nlhow...@gmail.com> wrote:
> > 
> > Here's my (inline) review of the github PR at
> > 
> > https://github.com/apertium/lttoolbox/pull/55
> > 
> > Cheers,
> > 
> > Nick
> > 
> > --
> > > From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg <amr_moha...@live.com>
> > > Date: Sat, 11 May 2019 01:30:26 +0200
> > > Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses
> > >  generation
> > >
> > > Solves #45
> > > Consider alphanumeric characters to be part of the vocabulary.
> > > ---
> > >  lttoolbox/fst_processor.cc | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc
> > > index 2be326a..d3d029a 100644
> > > --- a/lttoolbox/fst_processor.cc
> > > +++ b/lttoolbox/fst_processor.cc
> > > @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const

> > >  bool
> > >  FSTProcessor::isAlphabetic(wchar_t const c) const
> > >  {
> > > -  return alphabetic_chars.find(c) != alphabetic_chars.end();
> > > +  return (bool)std::iswalnum(c) || alphabetic_chars.find(c) !=

> > >    alphabetic_chars.end();
> > >  }
> > >
> > >  void
> > > --
> > > 2.21.0
> > 
> > Are we sure that std::iswalnum does what we want? Should it be
> > encoding-dependent? Do we take encoding from the environment?
> >
> This commit isn't part of the PR. A separate issue was filed and can

> be found through: https://github.com/apertium/lttoolbox/issues/45 [ht-

> tps://avatars0.githubusercontent.com/u/4594256?s=400&v=4]<https://git-
> hub.com/apertium/lttoolbox/issues/45> Strange output when transducer

> is compiled from a .att file · Issue #45 ·
> apertium/lttoolbox<https://github.com/apertium/lttoolbox/issues/45>
> github.com I am trying to add weights to the morphological analyser.

> So while I was checking last year's project (http://wiki.apertium.org-

> /wiki/User:Techievena/GSoC_2018_Work_Product_Submission) I noticed...


OK

> > > From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg <amr_moha...@live.com>
> > > Date: Sun, 19 May 2019 22:39:07 +0200
> > > Subject: [PATCH 02/12] Add basic structure for the supervised weighting of
> > >  automata
> > >
> > > ---
> > > diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged
> > 
> > What does this file do? These things should be described in the
> > long-form part of the commit message.
> > 
> I deleted these files in a later commit.

OK; can be fixed with rebase then. 

> > > diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin
> > 
> > Please don't add binary blobs to git history; better to add a text

> > representation along with the necessary rules to the build system to

> > reconstruct the binary.
> > 
> - I deleted these files in a later commit.

OK, can be fixed in rebase

> > > diff --git a/scripts/lt-weight b/scripts/lt-weight
> > > new file mode 100755
> > > index 0000000..a8b7c3e
> > > --- /dev/null
> > > +++ b/scripts/lt-weight
> > > @@ -0,0 +1,2 @@
> > > +#! /bin/sh
> > > +
> > > --
> > > 2.21.0
> > 
> > No point in just adding an empty file; just add the basic
> > implementation in the first step
>
> I thought of using commits to reflect the progress and squashing these

> commits later on merging.

As we discussed, pull requests should be "fully formed" so that
reviewers don't spend time catching bugs that are fixed later in the
series; rebase first so that your PR consists of a sequence of
correct and well-explained patches.

> > > From 8b11bb4e0fc606c465815a3ec5dd15c13a34da8f Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg <amr_moha...@live.com>
> > > Date: Mon, 20 May 2019 21:52:17 +0200
> > > Subject: [PATCH 03/12] lt-weight: Estimate the weights using unigram 
> > > counts
> > >
> > > ---
> > >  scripts/lt-weight | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/scripts/lt-weight b/scripts/lt-weight
> > > index a8b7c3e..d1abd6b 100755
> > > --- a/scripts/lt-weight
> > > +++ b/scripts/lt-weight
> > > @@ -1,2 +1,7 @@
> > >  #! /bin/sh
> > 
> > add help and version output, and copyright/license info
> - Will we end-up using a shell script or we will need to implement a CC file?

That is something we can decide later. If the shell script is correct
(and portable) and not a resource hog, it is probably fine to leave
it.

> I believe we need to decide on the script's language first.

Nothing can be accepted into an upstream repository without
copyright/license information; otherwise the repository becomes
unredistributable. (Default is "all rights reserved", meaning no
copying.)

Help and version output are required so that others can understand how

to use/test your script, and know what is going on with the version.
(Maybe the version can just be pulled from the project version.)

> > 
> > 
> > > +CORPUS=$2
> > > +LINES=$(wc -l $CORPUS | cut -d ' ' -f1)
> > >
> > > +cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |
> > 
> > unnecessary use of "cat"; instead use <"$CORPUS" (quoting in case of

> > whitespace in the filename).
> > 
> - Ok, this cat wasn't really beneficial.
> > 
> > > +sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |
> > 
> > Combine the multiple sed commands into one execution

> - Ok, let me check how to do so.  It should be easy.

Just "man sed" tells you; if you want to pass multiple commands, just
use "-e <command>" multiple times. Commands will be executed in the
order you pass them.

> > > From 1d78c1698727de7abcd17e3d92a9d099d2f920f7 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg <amr_moha...@live.com>
> > > Date: Tue, 21 May 2019 00:02:04 +0200
> > > Subject: [PATCH 04/12] lt-weight: Use HFST to weight the input fst

> > 
> > long-form commit message?
> > >
> > > ---
> > >  scripts/lt-weight | 30 +++++++++++++++++++++++++++++-
> > >  1 file changed, 29 insertions(+), 1 deletion(-)
> > 
> > >
> > 
> > > diff --git a/scripts/lt-weight b/scripts/lt-weight
> > > index d1abd6b..092fc66 100755
> > > --- a/scripts/lt-weight
> > > +++ b/scripts/lt-weight
> > 
> > It seems to me that this version and the previous version should be

> > two different scripts: one that reads a corpus and produces weights

> > which can be read by hfst-strings2fst, and one which performs the
> > reweighting of the fst.
> - Agreed.

OK, we are discussing this on irc, for those following along at home.

> > > From b34b6769980fc2a1cb7d3659eb5c16c3894c8348 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg <amr_moha...@live.com>
> > > Date: Tue, 21 May 2019 17:59:05 +0200
> > > Subject: [PATCH 05/12] lt-weight: Update the script to handle non-trivial
> > >  corpora/transducers
> > 
> > in the long-form message, explain what problems this patch is solving?
> 
> - I am checking the git rebase command so that we can discuss how to

> update the commit messages in the best way.

OK, keep me updated on how this is going.

> > > diff --git a/scripts/lt-weight b/scripts/lt-weight
> > > index 5420997..8cfe185 100755
> > > --- a/scripts/lt-weight
> > > +++ b/scripts/lt-weight
> > > @@ -3,47 +3,50 @@ FST=$1
> > >  CORPUS=$2
> > >  OUTPUT_FST=$3
> > >
> > > -# Temporary intermediate files
> > > +# Temporary directory for intermediate files
> > >  TEMP_DIR=".tmp"
> > > -mkdir $TEMP_DIR
> > > +# Check if it exists
> > > +if [-d "$TEMP_DIR"]; then
> > > +     mkdir $TEMP_DIR
> > > +fi
> > 
> > doesn't this only create the directory if it already exists? I don't

> > understand the logic
> > 
> - Solved in a later commit. Sorry for that.

Another candidate for the rebase. :)

> > >  # Delete the temporary files
> > > -rm -rf $TEMP_DIR
> > > +# rm -rf $TEMP_DIR
> > > +# ./lt-weight ../../apertium-eng/eng.automorf.bin ../../apertium-

> > > eng/texts/eng.tagged efst.bin
> > 
> > don't submit patches with commented-out code :( better to add a
> > "--debug" flag or DEBUG env variable, and lock it behind that
> > 
> - Could you elaborate more on using a debug flag?  How could it be
> used?

Well, if this is debugging info, you might add an optional argument
"--debug"; something like this:

```bash

# beginning of script
if [[ "x$1" = "x--debug" ]]; then
   shift;
   DEBUG=1
fi

# codecodecode

[[ -z "$DEBUG" ]] && rm -rf "$TEMP_DIR"

```

> > > diff --git a/scripts/prepare_regex_strings.py
> > > b/scripts/prepare_regex_strings.py
> > > new file mode 100644
> > > index 0000000..d2eac86
> > > --- /dev/null
> > > +++ b/scripts/prepare_regex_strings.py
> > > @@ -0,0 +1,41 @@
> > > +import re
> > > +import sys
> > > +import numpy as np
> > > +import pandas as pd
> > 
> > now we have dependencies on numpy and pandas; maybe it is possible to
> > 
> > get away with just vanilla python?
> > 
> - Yes, vanilla python could do the job.
> I just wanted to get the whole script up and running.
> In the end, I don't think we can use python (an additional dependency).

OK, so we will have to figure out how to do this using shell...

> > > +
> > > +#TODO: HANDLE THE REST OF THE SPECIAL CHARACTERS
> > 
> > what does "the rest" mean?
> - There may exist other special characters that need to be escaped.
> I don't know/have an exhaustive list of special characters.

OK, where should we look / who should we ask to find out the
exhaustive list?

> > needs "--help", license and copyright info; probably could mark it

> > executable. It's also important to figure out what *exact*
> > dependencies this has; does it work with python2? python3? Does
> > lttoolbox already have a python dependency? If not, better to use
> > something we already depend on.
> 
> - lttoolbox uses python only for testing
>   (https://github.com/apertium/lttoolbox/search?l=python) so I believe
>   python isn't a dependency. Let's discuss the used languages for the

>   scripts on the upcoming meeting.

Sounds good. I guess we are looking at shell-or-C(++?), then, for
anything that is targeted at downstream users. It's OK to add python
dependencies for testing, evaluation, or experimental code, but we
need to keep in mind that anything which is making it to our actual
users is going to have to be rewritten...

> > > From d92efe0a07bfb0b7a267f16ccdd26fd3e182ae19 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg <amr_moha...@live.com>
> > > Date: Tue, 4 Jun 2019 21:17:45 +0200
> > > Subject: [PATCH 08/12] Fix the repeated analyses bug

> > > diff --git a/scripts/lt-weight b/scripts/lt-weight
> > > index 8cfe185..25f44dd 100755
> > > --- a/scripts/lt-weight
> > > +++ b/scripts/lt-weight
> > > @@ -6,7 +6,7 @@ OUTPUT_FST=$3
> > >  # Temporary directory for intermediate files
> > >  TEMP_DIR=".tmp"
> > >  # Check if it exists
> > > -if [-d "$TEMP_DIR"]; then
> > > +if ! [ -d "$TEMP_DIR" ]; then
> > 
> > should fix this bug in the patch it was introduced in
> > 
> - Is editing the history the better choice? My mindset was to fix bugs

>   in new commits other than missing-up with the commits history.

Discussed above; review is done on a per-patch basis (*not* just the
diff between original and final states), and people shouldn't review
patches with known bugs in them.

> > >        mkdir $TEMP_DIR
> > >  fi
> > >
> > > diff --git a/scripts/prepare_regex_strings.py
> > > b/scripts/prepare_regex_strings.py
> > > index d2eac86..2f9300a 100644
> > > --- a/scripts/prepare_regex_strings.py
> > > +++ b/scripts/prepare_regex_strings.py
> > > @@ -36,6 +36,5 @@ if __name__ == '__main__':
> > >        lines = list(pd.Series(lines).value_counts().reset_index().app-
> > >        ly(lambda r: '{}::{}'.format(r['index'].strip(), -
> > >        np.log(r[0]/len(lines))), axis=1))
> > >
> > >        with open(FILE_NAME, 'w') as f:
> > > -             f.write('[?*]::1000000\n')
> > 
> > you don't mention why this is being removed; can't you just remove it
> > from the earlier patch?
> - OOV tokens are weighted in a different way now.

OK, I think this was merged into a commit whose subject didn't mention

this; it should be split into a separate commit (or, if it doesn't
make sense, don't include the old version and just merge this in with
the patch introducing the code)

> > 
> > > From 6caa6cd356732dcf3249522e67fe7c03ec2134f9 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg <amr_moha...@live.com>
> > > Date: Tue, 4 Jun 2019 23:03:01 +0200
> > > Subject: [PATCH 09/12] Fix the used character to replace a space

> > 
> > What was wrong with the old one? Maybe fix it earlier in the patch

> > 
> > series?
> - I thought using hfst's special token for a space was better.

OK, some explanation is definitely deserved in the commit message.
Probably better just to use this from the beginning, also.

> > >
> > > ---
> > >  scripts/lt-weight | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/scripts/lt-weight b/scripts/lt-weight
> > > index 25f44dd..5e7561a 100755
> > > --- a/scripts/lt-weight
> > > +++ b/scripts/lt-weight
> > > @@ -25,7 +25,7 @@ DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att"
> > >  # Convert the input FST to HFST
> > >  lt-print $FST |
> > >  sed -e 's/:/\\:/'|
> > > -sed -e :a -e 's/ /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST
> > > +sed -e :a -e 's/ /@_SPACE_@/;ta'> $ATTFST
> > >
> > >  hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST
> > >
> > > --
> > > 2.21.0
> > >
> > > From 96322670d0d957f8b3fae476fe6b255247fe4f56 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg <amr_moha...@live.com>
> > > Date: Wed, 5 Jun 2019 00:25:40 +0200
> > > Subject: [PATCH 10/12] Minimize the FST to prevent segmenation faults in
> > >  lt-comp
> > 
> > why is lt-comp segfaulting? that shouldn't happen; can you describe

> > 
> > the problem / link a bug for lt-comp?
> > 
> - Large fsts. Didn't really investigate the size of the fsts back then.
> Just used minimize to optimize the fsts' sizes.

OK, this should be in the commit message, as well as a comment added
where you are minimising the fsts, so someone reading the code
understands why this is necessary.

And maybe open a bug on lt-comp, and link the bug report? Segfaults
shouldn't occur; ENOMEM is a reasonable crash, but segfault is bad...

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] [PATCH] supervised weighting of automata

Reply via email to