[PR] N-Gram statistics using lineage [systemds]

via GitHub Sat, 03 Aug 2024 05:12:49 -0700


Jaybit0 opened a new pull request, #2062:
URL: https://github.com/apache/systemds/pull/2062


   This PR is an extension to PR #2045 and implements support for 
data-dependent n-grams using and extending the existing lineage functionality. 
As we are dealing with DAGs which are not linear sequences of instructions, I 
implemented the extension in such a way that it tracks every instruction path 
of the length `n`. If we had for example the DAG `(a*b + c/d)` and wanted to 
record bigrams, the two operation sequences `[(*, +), (/, +)]` would be added 
to the bigram store. I also keep track of the individual data-types of each 
instruction which is why I extended the existing lineage functionality as the 
`_data` string is sometimes empty and contains inconsistent information.
   
   The n-gram table now looks like this, where the arguments within brackets 
show the input parameters of an instruction (separated by `°`) and the suffix 
`[i]` represents the parameter index for the following instruction (e.g. for 
the first entry the result of `rblk` is used as the second paremeter for 
`ba+*`):
   ```
   Most common 2-grams (sorted by absolute time):
     #  N-Gram                          Time(s)  StdDev(t)/Mean(t)  Count
     1  (rblk·MATRIX·FP64(MATRIX·FP64)    1,144         (, 1.067)      4
        [1], ba+*·MATRIX·FP64(MATRIX·F                                  
        P64 ° MATRIX·FP64))                                             
     2  (rblk·MATRIX·FP64(MATRIX·FP64)    0,853         (, 0.469)      3
        [0], ba+*·MATRIX·FP64(MATRIX·F                                  
        P64 ° MATRIX·FP64))                                             
     3  (createvar·MATRIX·FP64()[0], r    0,343    (0.627, 0.929)      2
        blk·MATRIX·FP64(MATRIX·FP64))                                   
     4  (rblk·MATRIX·FP64(MATRIX·FP64)    0,285                 -      1
        [0], cpvar·MATRIX·FP64(MATRIX·                                  
        FP64))                                                          
     5  (+*·MATRIX·FP64(MATRIX·FP64 °     0,153                 -      1
        SCALAR·FP64 ° MATRIX·FP64)[0],                                  
         write·MATRIX·FP64(MATRIX·FP64                                  
         ° L_SCALAR·STRING ° L_SCALAR·                                  
        STRING ° L_SCALAR·INT64))                                       
   ```
   @mboehm7 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] N-Gram statistics using lineage [systemds]

Reply via email to