Package: presage Version: 0.8.8-1ubuntu3 Severity: normal Tags: upstream patch
Dear Maintainer, Currently presage splits on apostrophes and is unable to represent words containing apostrophes in the database (due to them not being escaped). This results in presage being unable to correctly predict words like "don't". Viewing the database for the English predictions shows that this is being represented a 2-gram of: "don" and "t". The expected result is that this would be represented as a 1-gram of: "don't". I realise you're also upstream developer, so have seen my upstream bug report for this already. Basically we're planning on including this patch temporarily in the Ubuntu presage package and were wondering if you'd be interested in including it in the Debian package until an upstream solution is ready. If so we can then just sync our Ubuntu package with your Debian package, otherwise if you'd rather wait until you've got a more comprehensive upstream solution we'll just apply the patch temporarily in Ubuntu and then sync once it's fixed upstream. Thanks! P.S. I'm still getting to grips with Debian packaging procedures, so apologies if I've misstepped anywhere! -- System Information: Debian Release: jessie/sid APT prefers utopic-updates APT policy: (500, 'utopic-updates'), (500, 'utopic-security'), (500, 'utopic'), (100, 'utopic-backports') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 3.16.0-24-generic (SMP w/4 CPU cores) Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages presage depends on: ii libc6 2.19-10ubuntu2 ii libgcc1 1:4.9.1-16ubuntu6 ii libncurses5 5.9+20140712-2ubuntu1 ii libpresage1 0.8.8-1ubuntu3 ii libsqlite3-0 3.8.6-1 ii libstdc++6 4.9.1-16ubuntu6 ii libtinfo5 5.9+20140712-2ubuntu1 presage recommends no packages. presage suggests no packages. -- no debconf information
Description: Allow words with apostrophes to be predicted Stop the tokenizer from splitting based on apostrophes and allow for the escaping of words containing apostrophes in the database connector. Author: Michael Sheldon <[email protected]> Forwarded: https://sourceforge.net/p/presage/patches/2/ Bug-Ubuntu: https://launchpad.net/bugs/1384800 --- presage-0.9.orig/src/lib/core/charsets.h +++ presage-0.9/src/lib/core/charsets.h @@ -180,7 +180,6 @@ const char DEFAULT_SEPARATOR_CHARS[]={ '$', '%', '&', - '\'', '(', ')', '*', --- presage-0.9.orig/src/lib/predictors/dbconnector/databaseConnector.cpp +++ presage-0.9/src/lib/predictors/dbconnector/databaseConnector.cpp @@ -30,6 +30,7 @@ #include <sstream> #include <stdlib.h> #include <assert.h> +#include <boost/algorithm/string/replace.hpp> DatabaseConnector::DatabaseConnector(const std::string database_name, const size_t cardinality, @@ -293,12 +294,8 @@ std::string DatabaseConnector::buildValu std::string DatabaseConnector::sanitizeString(const std::string str) const { - // TODO - // just return the string for the time being - // REVISIT - // TO BE DONE - // TBD - return str; + // Escape single quotes + return boost::replace_all_copy(str, "'", "''"); } int DatabaseConnector::extractFirstInteger(const NgramTable& table) const --- presage-0.9.orig/src/tools/text2ngram.cpp +++ presage-0.9/src/tools/text2ngram.cpp @@ -174,7 +174,7 @@ int main(int argc, char* argv[]) std::ifstream infile(argv[i]); ForwardTokenizer tokenizer(infile, " \f\n\r\t\v", - "`~!@#$%^&*()_-+=\\|]}[{'\";:/?.>,<"); + "`~!@#$%^&*()_-+=\\|]}[{\";:/?.>,<"); tokenizer.lowercaseMode(lowercase); // take care of first N-1 tokens

