[GitHub] [opennlp] autayeu commented on issue #329: OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence…

GitHub Thu, 04 Oct 2018 02:43:54 -0700

Let me follow your lead and let's take another look.

```java
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import opennlp.tools.sentdetect.lang.Factory;


class Scratch {

  private static final int ITERATIONS = 100_000;
  private static final char[] EN_SAMPLE = ("I think you are better off sending 
an email to the solr-user mailing " +
      "list (http://lucene.apache.org/solr/community.html#mailing-lists-irc) 
and explaining " +
      "more about your use case so we can understand what leads up to the dump. 
Most likely you " +
      "will find ways to reconfigure your cluster or queries in a way that 
avoids this situation. " +
      "Or perhaps your cluster is simply under-dimensioned.").toCharArray();
  private static final char[] PT_SAMPLE = ("A pré-história de Portugal é 
partilhada com a do resto da Península Ibérica. Os vestígios humanos modernos 
mais antigos conhecidos são de homens de Cro-Magnon com \"traços\" de 
Neanderthal, com 24 500 anos e que são interpretados como indicadores de 
extensas populações mestiças entre as duas espécies. São também os vestígios 
mais recentes de seres com caraterísticas de Neandertal que se conhece, 
provavelmente os últimos da sua espécie[28] Há cerca de 5500 a.C., surge uma 
cultura mesolítica.[29] Durante o Neolítico a região foi ocupada por pré-celtas 
e celtas, dando origem a povos como os galaicos, lusitanos e cinetes, e 
visitada por fenícios[30] e cartagineses. Os romanos incorporaram-na no seu 
Império como Lusitânia [31] (centro e sul de Portugal), após vencida a 
resistência onde se destacou Viriato.[29]\n"
      + "\n"
      + "No século III, foi criada a Galécia, a norte do Douro, a partir da 
Tarraconense, abrangendo o norte de Portugal. A romanização marcou a cultura, 
em especial a língua latina, que foi a base do desenvolvimento da língua 
portuguesa.[32]\n"
      + "\n"
      + "Com o enfraquecimento do império romano, a partir de 409, o território 
é ocupado por povos germânicos como vândalos na Bética, alanos que fixaram-se 
na Lusitânia e suevos na Galécia. Em 415 os visigodos entram na Península, a 
pedido dos romanos, para expulsar os invasores. Vândalos e alanos deslocam-se 
para o norte de África. Os suevos e visigodos fundam os primeiros reinos 
cristãos. Em 711 o território é conquistado pelos mouros que aí estabeleceram o 
Al-Andalus. Os cristãos recolhem-se para norte, acantonados no Reino das 
Astúrias. Em 868, durante a Reconquista, foi formado o Condado 
Portucalense.[33] ").toCharArray();
  private static final char[] JP_SAMPLE = 
("「にっぽん」、「にほん」と読まれる。どちらも多く用いられているため、日本政府は正式な読み方をどちらか一方には定めておらず、どちらの読みでも良いとしている[5]。\n"
      + "\n"
      + 
"7世紀の後半の国際関係から生じた「日本」国号は、当時の国際的な読み（音読）で「ニッポン」（呉音）ないし「ジッポン」（漢音）と読まれたものと推測される[6]。いつ「ニホン」の読みが始まったか定かでない。仮名表記では「にほん」と表記された。平安時代には「ひのもと」とも和訓されるようになった。\n"
      + "\n"
      + 
"室町時代の謡曲・狂言は、中国人に「ニッポン」と読ませ、日本人に「ニホン」と読ませている。安土桃山時代にポルトガル人が編纂した『日葡辞書』や『日本小文典』等には、「ニッポン」「ニホン」「ジッポン」の読みが見られ、その用例から判断すると、改まった場面・強調したい場合に「ニッポン」が使われ、日常の場面で「ニホン」が使われていた[7]。このことから小池清治は、中世の日本人が中国語的な語感のある「ジッポン」を使用したのは、中国人・西洋人など対外的な場面に限定されていて、日常だと「ニッポン」「ニホン」が用いられていたのでは、と推測している[8]。なお、現在に伝わっていない「ジッポン」音については、その他の言語も参照。
 ").toCharArray();
  private static final char[] TH_SAMPLE = 
("สืบจากปัญหาเศรษฐกิจรุนแรงจากภาวะเศรษฐกิจตกต่ำครั้งใหญ่ 
และราคาข้าวตกลงอย่างรุนแรง 
นอกจากนี้ยังมีการลดรายจ่ายภาครัฐอย่างมากทำให้เกิดความไม่พอใจในหมู่อภิชน[19]:25 
วันที่ 24 มิถุนายน 2475 
คณะราษฎรนำปฏิวัติเปลี่ยนแปลงการปกครองจากสมบูรณาญาสิทธิราชย์มาเป็นระบอบประชาธิปไตย
 ทำให้คณะราษฎรเข้ามามีบทบาททางการเมือง ปลายปี 2476 เกิดกบฏบวรเดช 
ซึ่งหวังเปลี่ยนแปลงการปกครองกลับสู่สมบูรณาญาสิทธิราช แต่ล้มเหลว[28]:446–8 ปี 
2477 
พระบาทสมเด็จพระปกเกล้าเจ้าอยู่หัวมีความเห็นไม่ลงรอยกับรัฐบาลจึงทรงสละราชสมบัติในปี
 2478 สภาผู้แทนราษฎรเลือกพระวรวงศ์เธอ พระองค์เจ้าอานันทมหิดลเป็นพระมหากษัตริย์ 
ซึ่งขณะนั้นทรงศึกษาอยู่ในประเทศสวิสเซอร์แลนด์[28]:448–9\n"
      + "\n"
      + "เดือนธันวาคม 2481 พลตรีหลวงพิบูลสงครามได้เป็นนายกรัฐมนตรี 
เขาปราบปรามศัตรูทางการเมืองรวมทั้งต่อต้านราชวงศ์อย่างเปิดเผย[28]:457 
รัฐบาลมีแนวคิดชาตินิยมและปรับให้เป็นตะวันตก 
และเริ่มดำเนินนโยบายต่อต้านจีนและฝรั่งเศส[19]:28 วันที่ 23 มิถุนายน 2482 
มีการเปลี่ยนชื่อประเทศจาก \"สยาม\" มาเป็น \"ไทย\" ในปี 2484 
เกิดสงครามขนาดย่อมขึ้นระหว่างวิชีฝรั่งเศสกับไทย 
ทำให้ไทยได้ดินแดนเพิ่มจากลาวและกัมพูชาช่วงสั้น ๆ[28]:462 วันที่ 8 ธันวาคม 
ปีเดียวกัน ประเทศญี่ปุ่นบุกครองไทย และรัฐบาลลงนามเป็นพันธมิตรทางทหารกับญี่ปุ่น 
และประกาศสงครามกับสหรัฐและสหราชอาณาจักร[28]:465 
มีการตั้งขบวนการเสรีไทยขึ้นทั้งในและต่างประเทศเพื่อต่อต้านรัฐบาลและการยึดครองของญี่ปุ่น[28]:465–6
 หลังสงครามยุติในปี 2488 
ประเทศไทยลงนามความตกลงสมบูรณ์แบบเพื่อเลิกสถานะสงครามกับฝ่ายสัมพันธมิตร 
").toCharArray();

  private static Set<Character> eosCharacters;
  private static char[] eosChars;

  private static void setEosCharacters(char[] chars) {
    eosChars = chars;
    eosCharacters = new HashSet<>();
    for (char eosChar: chars) {
      eosCharacters.add(eosChar);
    }
  }

  public static void main(String[] args) {
    System.out.println("Language\tarray (ms)\tset (ms)");
    
    testBuffer(EN_SAMPLE, "English", Factory.defaultEosCharacters);

    testBuffer(PT_SAMPLE, "Portuguese", Factory.ptEosCharacters);

    testBuffer(JP_SAMPLE, "Japanese", Factory.jpnEosCharacters);

    testBuffer(TH_SAMPLE, "Thai", Factory.thEosCharacters);
  }

  private static void testBuffer(final char[] cbuf, final String langName, 
final char[] eosChars) {
    setEosCharacters(eosChars);

    long arrayDuration;
    long setDuration;
    {
      long start = System.currentTimeMillis();
      for (int n = 0; n < ITERATIONS; n++) {
        getPositionsArray(cbuf);
      }
      arrayDuration = System.currentTimeMillis() - start;
    }

    {
      long start = System.currentTimeMillis();
      for (int n = 0; n < ITERATIONS; n++) {
        getPositionsHashset(cbuf);
      }
      setDuration = System.currentTimeMillis() - start;
    }

    System.out.println(langName + "\t" + arrayDuration + "\t" + setDuration);
  }

  private static List<Integer> getPositionsArray(char[] cbuf) {
    List<Integer> l = new ArrayList<>();
    for (int i = 0; i < cbuf.length; i++) {
      for (char eosCharacter : eosChars) {
        if (cbuf[i] == eosCharacter) {
          l.add(i);
          break;
        }
      }
    }
    return l;
  }

  private static List<Integer> getPositionsHashset(char[] cbuf) {
    List<Integer> l = new ArrayList<>();
    for (int i = 0; i < cbuf.length; i++) {
      if (eosCharacters.contains(cbuf[i])) {
        l.add(i);
      }
    }
    return l;
  }
  
}
```

| Language   | array (ms) | set (ms) |
|------------|------------|----------|
| English    | 156        | 217      |
| Portuguese | 1071       | 819      |
| Japanese   | 515        | 1249     |
| Thai       | 1324       | 2917     |


Perhaps the following questions might be relevant here:
* At which size HashSet becomes faster than array for lookups? Fundamentally, 
array involves only comparison while HashSet involves hash calculation, generic 
overheads and comparison.
* What's the average eos character count across languages?
* What's the distribution of eos character counts across languages?
* For how many languages the boundary mentioned in the first question is 
crossed?
* What's the frequency of each eos character in the target language?
* What the distribution of eos characters in target language?
* If the eos character count justifies the switch to HashSet (see first 
question), then does the form of eos character distribution still backs up the 
switch? Power law distributions might be common in language and the fat head of 
such distributions might still make array lookups faster than HashSet.
* Which language is processed by OpenNLP the most? or rather:
* What's the distribution of languages across OpenNLP users? 
* Based on the answers to the above, which change at what level would be the 
most beneficial for most our users? Would the change at language level be 
better than globally for all languages? For example, there is a 
`opennlp.tools.sentdetect.lang.th.SentenceContextGenerator` which sets its own 
eosCharacters. Just as well there could be a Portuguese version of sentence 
scanner which uses HashMap for lookups. By the way, there is a duplication of 
eosCharacters constant in Thai `SentenceContextGenerator` and an override of a 
deprecated method.

What do you think?

[ Full content available at: https://github.com/apache/opennlp/pull/329 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [opennlp] autayeu commented on issue #329: OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence…

Reply via email to