Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/66
  
    #### Requirements
    
    ```sql
    drop temporary function if exists tokenize;
    create temporary function tokenize as 'hivemall.tools.text.TokenizeUDF';
    
    drop temporary function if exists is_stopword;
    create temporary function is_stopword as 'hivemall.tools.text.StopwordUDF';
    
    drop temporary function if exists feature;
    create temporary function feature as 'hivemall.ftvec.FeatureUDF';
    
    drop temporary function if exists lda;
    create temporary function lda as 'hivemall.lda.OnlineLDAUDTF';
    ```
    
    #### Sample query
    
    ```sql
    with features as (
      select
        docid,
        feature(word, count(word)) as f
      from (
        select 1 as docid, "Fruits and vegetables are healthy." as doc
        union all
        select 2 as docid, "I like apples, oranges, and avocados. I do not like 
the flu or colds." as doc
      ) t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word
      where
        not is_stopword(word)
      group by
        docid, word
    ),
    t as (
      select docid, collect_set(f) as words
      from features
      group by docid
    )
    select lda(words, "-topic 2 -iter 20") from t
    ;
    ```
    
    #### Result
    
    |topic | word   | score|
    |:---:|:---:|:---:|
    |0     | fruits | 0.33372128|
    |0     | vegetables  |    0.33272517|
    |0     | healthy | 0.33246377|
    |0     | flu   |  2.3617347E-4|
    |0     | apples | 2.1898883E-4|
    |0     | oranges | 1.8161473E-4|
    |0     | like   | 1.7666373E-4|
    |0     | avocados  |      1.726186E-4|
    |0     | colds  | 1.037139E-4|
    |1     | colds  | 0.16622013|
    |1     | avocados |       0.16618845|
    |1     | oranges | 0.1661859|
    |1     | like  |  0.16618414|
    |1     | apples |  0.16616651|
    |1     | flu   |  0.16615893|
    |1     | healthy | 0.0012059759|
    |1     | vegetables  |    0.0010818697|
    |1     | fruits  | 6.080827E-4|
    
    Clearly, topic0 corresponds to doc1, and topic1 represents doc2 topic words.
    
    @myui Could you review whether this interface is sufficient?
    
    From now, I will carefully check if the algorithm is implemented correctly 
in `OnlineLDAModel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to