[jira] [Commented] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner

ASF GitHub Bot (JIRA) Thu, 04 Oct 2018 02:44:56 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637991#comment-16637991
 ]


ASF GitHub Bot commented on OPENNLP-1214:
-----------------------------------------

autayeu commented on issue #329: OPENNLP-1214: use hash to avoid linear search 
in DefaultEndOfSentence…
URL: https://github.com/apache/opennlp/pull/329#issuecomment-426953820
 
 
   Let me follow your lead and let's take another look.
   
   ```java
   import java.util.ArrayList;
   import java.util.HashSet;
   import java.util.List;
   import java.util.Set;
   import opennlp.tools.sentdetect.lang.Factory;
   
   class Scratch {
   
     private static final int ITERATIONS = 100_000;
     private static final char[] EN_SAMPLE = ("I think you are better off 
sending an email to the solr-user mailing " +
         "list (http://lucene.apache.org/solr/community.html#mailing-lists-irc) 
and explaining " +
         "more about your use case so we can understand what leads up to the 
dump. Most likely you " +
         "will find ways to reconfigure your cluster or queries in a way that 
avoids this situation. " +
         "Or perhaps your cluster is simply under-dimensioned.").toCharArray();
     private static final char[] PT_SAMPLE = ("A pré-história de Portugal é 
partilhada com a do resto da Península Ibérica. Os vestígios humanos modernos 
mais antigos conhecidos são de homens de Cro-Magnon com \"traços\" de 
Neanderthal, com 24 500 anos e que são interpretados como indicadores de 
extensas populações mestiças entre as duas espécies. São também os vestígios 
mais recentes de seres com caraterísticas de Neandertal que se conhece, 
provavelmente os últimos da sua espécie[28] Há cerca de 5500 a.C., surge uma 
cultura mesolítica.[29] Durante o Neolítico a região foi ocupada por pré-celtas 
e celtas, dando origem a povos como os galaicos, lusitanos e cinetes, e 
visitada por fenícios[30] e cartagineses. Os romanos incorporaram-na no seu 
Império como Lusitânia [31] (centro e sul de Portugal), após vencida a 
resistência onde se destacou Viriato.[29]\n"
         + "\n"
         + "No século III, foi criada a Galécia, a norte do Douro, a partir da 
Tarraconense, abrangendo o norte de Portugal. A romanização marcou a cultura, 
em especial a língua latina, que foi a base do desenvolvimento da língua 
portuguesa.[32]\n"
         + "\n"
         + "Com o enfraquecimento do império romano, a partir de 409, o 
território é ocupado por povos germânicos como vândalos na Bética, alanos que 
fixaram-se na Lusitânia e suevos na Galécia. Em 415 os visigodos entram na 
Península, a pedido dos romanos, para expulsar os invasores. Vândalos e alanos 
deslocam-se para o norte de África. Os suevos e visigodos fundam os primeiros 
reinos cristãos. Em 711 o território é conquistado pelos mouros que aí 
estabeleceram o Al-Andalus. Os cristãos recolhem-se para norte, acantonados no 
Reino das Astúrias. Em 868, durante a Reconquista, foi formado o Condado 
Portucalense.[33] ").toCharArray();
     private static final char[] JP_SAMPLE = 
("「にっぽん」、「にほん」と読まれる。どちらも多く用いられているため、日本政府は正式な読み方をどちらか一方には定めておらず、どちらの読みでも良いとしている[5]。\n"
         + "\n"
         + 
"7世紀の後半の国際関係から生じた「日本」国号は、当時の国際的な読み（音読）で「ニッポン」（呉音）ないし「ジッポン」（漢音）と読まれたものと推測される[6]。いつ「ニホン」の読みが始まったか定かでない。仮名表記では「にほん」と表記された。平安時代には「ひのもと」とも和訓されるようになった。\n"
         + "\n"
         + 
"室町時代の謡曲・狂言は、中国人に「ニッポン」と読ませ、日本人に「ニホン」と読ませている。安土桃山時代にポルトガル人が編纂した『日葡辞書』や『日本小文典』等には、「ニッポン」「ニホン」「ジッポン」の読みが見られ、その用例から判断すると、改まった場面・強調したい場合に「ニッポン」が使われ、日常の場面で「ニホン」が使われていた[7]。このことから小池清治は、中世の日本人が中国語的な語感のある「ジッポン」を使用したのは、中国人・西洋人など対外的な場面に限定されていて、日常だと「ニッポン」「ニホン」が用いられていたのでは、と推測している[8]。なお、現在に伝わっていない「ジッポン」音については、その他の言語も参照。
 ").toCharArray();
     private static final char[] TH_SAMPLE = 
("สืบจากปัญหาเศรษฐกิจรุนแรงจากภาวะเศรษฐกิจตกต่ำครั้งใหญ่ 
และราคาข้าวตกลงอย่างรุนแรง 
นอกจากนี้ยังมีการลดรายจ่ายภาครัฐอย่างมากทำให้เกิดความไม่พอใจในหมู่อภิชน[19]:25 
วันที่ 24 มิถุนายน 2475 
คณะราษฎรนำปฏิวัติเปลี่ยนแปลงการปกครองจากสมบูรณาญาสิทธิราชย์มาเป็นระบอบประชาธิปไตย
 ทำให้คณะราษฎรเข้ามามีบทบาททางการเมือง ปลายปี 2476 เกิดกบฏบวรเดช 
ซึ่งหวังเปลี่ยนแปลงการปกครองกลับสู่สมบูรณาญาสิทธิราช แต่ล้มเหลว[28]:446–8 ปี 
2477 
พระบาทสมเด็จพระปกเกล้าเจ้าอยู่หัวมีความเห็นไม่ลงรอยกับรัฐบาลจึงทรงสละราชสมบัติในปี
 2478 สภาผู้แทนราษฎรเลือกพระวรวงศ์เธอ พระองค์เจ้าอานันทมหิดลเป็นพระมหากษัตริย์ 
ซึ่งขณะนั้นทรงศึกษาอยู่ในประเทศสวิสเซอร์แลนด์[28]:448–9\n"
         + "\n"
         + "เดือนธันวาคม 2481 พลตรีหลวงพิบูลสงครามได้เป็นนายกรัฐมนตรี 
เขาปราบปรามศัตรูทางการเมืองรวมทั้งต่อต้านราชวงศ์อย่างเปิดเผย[28]:457 
รัฐบาลมีแนวคิดชาตินิยมและปรับให้เป็นตะวันตก 
และเริ่มดำเนินนโยบายต่อต้านจีนและฝรั่งเศส[19]:28 วันที่ 23 มิถุนายน 2482 
มีการเปลี่ยนชื่อประเทศจาก \"สยาม\" มาเป็น \"ไทย\" ในปี 2484 
เกิดสงครามขนาดย่อมขึ้นระหว่างวิชีฝรั่งเศสกับไทย 
ทำให้ไทยได้ดินแดนเพิ่มจากลาวและกัมพูชาช่วงสั้น ๆ[28]:462 วันที่ 8 ธันวาคม 
ปีเดียวกัน ประเทศญี่ปุ่นบุกครองไทย และรัฐบาลลงนามเป็นพันธมิตรทางทหารกับญี่ปุ่น 
และประกาศสงครามกับสหรัฐและสหราชอาณาจักร[28]:465 
มีการตั้งขบวนการเสรีไทยขึ้นทั้งในและต่างประเทศเพื่อต่อต้านรัฐบาลและการยึดครองของญี่ปุ่น[28]:465–6
 หลังสงครามยุติในปี 2488 
ประเทศไทยลงนามความตกลงสมบูรณ์แบบเพื่อเลิกสถานะสงครามกับฝ่ายสัมพันธมิตร 
").toCharArray();
   
     private static Set<Character> eosCharacters;
     private static char[] eosChars;
   
     private static void setEosCharacters(char[] chars) {
       eosChars = chars;
       eosCharacters = new HashSet<>();
       for (char eosChar: chars) {
         eosCharacters.add(eosChar);
       }
     }
   
     public static void main(String[] args) {
       System.out.println("Language\tarray (ms)\tset (ms)");
       
       testBuffer(EN_SAMPLE, "English", Factory.defaultEosCharacters);
   
       testBuffer(PT_SAMPLE, "Portuguese", Factory.ptEosCharacters);
   
       testBuffer(JP_SAMPLE, "Japanese", Factory.jpnEosCharacters);
   
       testBuffer(TH_SAMPLE, "Thai", Factory.thEosCharacters);
     }
   
     private static void testBuffer(final char[] cbuf, final String langName, 
final char[] eosChars) {
       setEosCharacters(eosChars);
   
       long arrayDuration;
       long setDuration;
       {
         long start = System.currentTimeMillis();
         for (int n = 0; n < ITERATIONS; n++) {
           getPositionsArray(cbuf);
         }
         arrayDuration = System.currentTimeMillis() - start;
       }
   
       {
         long start = System.currentTimeMillis();
         for (int n = 0; n < ITERATIONS; n++) {
           getPositionsHashset(cbuf);
         }
         setDuration = System.currentTimeMillis() - start;
       }
   
       System.out.println(langName + "\t" + arrayDuration + "\t" + setDuration);
     }
   
     private static List<Integer> getPositionsArray(char[] cbuf) {
       List<Integer> l = new ArrayList<>();
       for (int i = 0; i < cbuf.length; i++) {
         for (char eosCharacter : eosChars) {
           if (cbuf[i] == eosCharacter) {
             l.add(i);
             break;
           }
         }
       }
       return l;
     }
   
     private static List<Integer> getPositionsHashset(char[] cbuf) {
       List<Integer> l = new ArrayList<>();
       for (int i = 0; i < cbuf.length; i++) {
         if (eosCharacters.contains(cbuf[i])) {
           l.add(i);
         }
       }
       return l;
     }
     
   }
   ```
   
   | Language   | array (ms) | set (ms) |
   |------------|------------|----------|
   | English    | 156        | 217      |
   | Portuguese | 1071       | 819      |
   | Japanese   | 515        | 1249     |
   | Thai       | 1324       | 2917     |
   
   
   Perhaps the following questions might be relevant here:
   * At which size HashSet becomes faster than array for lookups? 
Fundamentally, array involves only comparison while HashSet involves hash 
calculation, generic overheads and comparison.
   * What's the average eos character count across languages?
   * What's the distribution of eos character counts across languages?
   * For how many languages the boundary mentioned in the first question is 
crossed?
   * What's the frequency of each eos character in the target language?
   * What the distribution of eos characters in target language?
   * If the eos character count justifies the switch to HashSet (see first 
question), then does the form of eos character distribution still backs up the 
switch? Power law distributions might be common in language and the fat head of 
such distributions might still make array lookups faster than HashSet.
   * Which language is processed by OpenNLP the most? or rather:
   * What's the distribution of languages across OpenNLP users? 
   * Based on the answers to the above, which change at what level would be the 
most beneficial for most our users? Would the change at language level be 
better than globally for all languages? For example, there is a 
`opennlp.tools.sentdetect.lang.th.SentenceContextGenerator` which sets its own 
eosCharacters. Just as well there could be a Portuguese version of sentence 
scanner which uses HashMap for lookups. By the way, there is a duplication of 
eosCharacters constant in Thai `SentenceContextGenerator` and an override of a 
deprecated method.
   
   What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> use hash to avoid linear search in DefaultEndOfSentenceScanner
> --------------------------------------------------------------
>
>                 Key: OPENNLP-1214
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1214
>             Project: OpenNLP
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.9.1
>
>
> When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to 
> check if each characters in the sentence is one of eos characters. I think 
> we'd better use HashSet to keep eosCharacters instead of char[].
> In accordance with this replacement, I'd like to make 
> getEndOfSentenceCharacters() deprecated because it returns char[] and nobody 
> in OpenNLP calls it at present, and I'd like to add the equivalent method 
> which returns Set<Character> of eos chars. Though it cannot keep the order of 
> eos chars but I don't think it can be a problem anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner

Reply via email to