Hi, Thanks for the response.
No, I don’t have the ‘true’ value and that’s where I found it challenging. I like the second part of your suggestion, thanks for that. Mohit On Fri, 26 Feb 2021 at 08:15, m.gufranpathan <[email protected]> wrote: > Hi, > > I'm assuming you know the "True" value of the company's classification > (i.e. You know whether the company is "valid" or "invalid") > > If that's the case, then you could calculate accuracy as no of correct > classifications divided by total number of classification. There are two > other metrics commonly used - Precision and Recall. You could use them if > the cost of incorrectly predicting a "Valid" company is higher (or lower) > than incorrect predicting an "Invalid" company. > > Happy to help if you have more questions / clarifications. > > > > Sent from my Samsung Galaxy smartphone. > > -------- Original message -------- > From: Mohit K <[email protected]> > Date: 2/21/21 4:01 PM (GMT+04:00) > To: [email protected] > Subject: [datameet] Comparing 2 Classification Algos > > Hi All, > > I am trying to evaluate 2 algos that classify companies based on certain > criteria. I need to compare which one is doing a better job at > classification. Data file can be found at below G drive link. Results of > both algos are in column J & K. > > > https://drive.google.com/file/d/1ZFtknedWZANrQQVgVqYxGMtCkhOvJ8hK/view?usp=sharing > > If anyone from data analytics background could help me, how to approach > this? > > Thanks, > Mohit > > Some of the details are given below: > > The file contains data of companies. Each row is a company in your > company database. As you are aware there are a lot of duplicated companies > so your database are marking them as “Invalid” in Column J “Flag” in the > old algorithm. The new algorithm has these “Flags” listed in Column K. So > when it says Valid, that’s a company that is determined by the algorithm to > be a good company + real company + not duplicated, to be kept in the > database. > > There are additional data in the file for each of the companies to help > you evaluate the companies. > > Some issues: > > > 1) Some companies have many legitimate subsidiaries. Like Google and > YouTube might be 2 companies but YouTube is a subsidiary of Google. What > you have decided to do is that you want these to stay in your database as 2 > separate companies, if these 3 conditions are met: > a) the subsidiary is large and >$100M revenue, > b) the name of that company looks substantially different from the parent, > and > c) that the identity of the subsidiary still exists because sometime the > parent company just absorbs the subsidiary into the parent company and the > subsidiary disappears ie their website no longer exists. > In the Google / YouTube example, all three conditions are met, so both > Google and YouTube are kept as different companies in your database. > > 2) There are many big companies that often have hundreds of subsidiaries > that are all pretty much the same company. For example, Citibank can have > many subsidiaries like Citibank Auto Loans, Citibank New York, Citibank > Florida, and those typically look like the same company to most consumers, > so you do not want to keep all those subsidiaries but just to keep the main > parent company. > > 3) When we have multiple of the same company in our database that are > exactly the same company, like the company name is the same and the url is > the same. In those cases, you want to keep the company listing in the > database that has the most information (e.g. revenue, employee #, etc.), > the highest revenue, etc, and remove the ones with less. > > 4) There are often wrong/incorrect information and so of course you want > to keep the database listing with the most accurate information. > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/datameet/CAJk6f4AftY473Q2ohbGzdPekhm7iybe349F69Xcqy_ZBrO3gTw%40mail.gmail.com > <https://groups.google.com/d/msgid/datameet/CAJk6f4AftY473Q2ohbGzdPekhm7iybe349F69Xcqy_ZBrO3gTw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/datameet/6034e8a9.1c69fb81.f66f4.932e%40mx.google.com > <https://groups.google.com/d/msgid/datameet/6034e8a9.1c69fb81.f66f4.932e%40mx.google.com?utm_medium=email&utm_source=footer> > . > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAJk6f4Bh4Dd3c%2BUk2EF%2BmL3Qg5xrDqq6Yq4f6oHMsnHCksSCJw%40mail.gmail.com.
