jimmylao edited a comment on issue #52:
URL: 
https://github.com/apache/incubator-bluemarlin/issues/52#issuecomment-1058102582


   @Bimlesh759-AI Thanks for sharing.
   As shown in the table below, the ratios between positive and negative 
samples are 1.07 in training data and 2.66 in testing data. These are balanced 
(good) ratios.
   
   <html xmlns:v="urn:schemas-microsoft-com:vml"
   xmlns:o="urn:schemas-microsoft-com:office:office"
   xmlns:x="urn:schemas-microsoft-com:office:excel"
   xmlns="http://www.w3.org/TR/REC-html40";>
   
   <head>
   
   <meta name=ProgId content=Excel.Sheet>
   <meta name=Generator content="Microsoft Excel 15">
   <link id=Main-File rel=Main-File
   href="file:///C:/Users/zlao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
   <link rel=File-List
   
href="file:///C:/Users/zlao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
   <!--table
        {mso-displayed-decimal-separator:"\.";
        mso-displayed-thousand-separator:"\,";}
   @page
        {margin:.75in .7in .75in .7in;
        mso-header-margin:.3in;
        mso-footer-margin:.3in;}
   tr
        {mso-height-source:auto;}
   col
        {mso-width-source:auto;}
   br
        {mso-data-placement:same-cell;}
   td
        {padding-top:1px;
        padding-right:1px;
        padding-left:1px;
        mso-ignore:padding;
        color:black;
        font-size:11.0pt;
        font-weight:400;
        font-style:normal;
        text-decoration:none;
        font-family:Calibri, sans-serif;
        mso-font-charset:0;
        mso-number-format:General;
        text-align:general;
        vertical-align:bottom;
        border:none;
        mso-background-source:auto;
        mso-pattern:auto;
        mso-protection:locked visible;
        white-space:nowrap;
        mso-rotate:0;}
   .xl65
        {text-align:center;}
   .xl66
        {mso-number-format:Percent;}
   .xl67
        {color:#24292F;
        font-family:"Segoe UI", sans-serif;
        mso-font-charset:0;
        mso-number-format:"\#\,\#\#0";
        text-align:center;}
   -->
   </head>
   
   <body link="#0563C1" vlink="#954F72">
   
     | positive # | negative # | pos vs neg ratio | %
   -- | -- | -- | -- | --
   train | 217,949,105 | 233,029,477 | 1.069192172 | 87.16%
   test | 18,131,220 | 48,307,204 | 2.66431073 | 12.84%
   
   
   
   </body>
   
   </html>
   
   
   One approach you may want to try is: keep all testing samples and randomly 
drop 50% of training samples, then train a model to see if there's significant 
drop in AUC for testing samples. (Since the ratio of testing samples is 13%-, 
it has much less contribution to training time).
   
   Based on your keyword break-down sample information, it looks good too. The 
following are curves of accumulated percentage of training and testing samples 
with all keywords. It shows that keyword samples agrees well in training and 
testing.
   
![image](https://user-images.githubusercontent.com/60371672/156586663-ec2b7fcd-7a20-4e33-afb0-1c2eb7d343b8.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@bluemarlin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to