Hi, I'm assuming you know the "True" value of the company's classification (i.e. You know whether the company is "valid" or "invalid") If that's the case, then you could calculate accuracy as no of correct classifications divided by total number of classification. There are two other metrics commonly used - Precision and Recall. You could use them if the cost of incorrectly predicting a "Valid" company is higher (or lower) than incorrect predicting an "Invalid" company. Happy to help if you have more questions / clarifications.
Sent from my Samsung Galaxy smartphone. -------- Original message --------From: Mohit K <[email protected]> Date: 2/21/21 4:01 PM (GMT+04:00) To: [email protected] Subject: [datameet] Comparing 2 Classification Algos Hi All, I am trying to evaluate 2 algos that classify companies based on certain criteria. I need to compare which one is doing a better job at classification. Data file can be found at below G drive link. Results of both algos are in column J & K. https://drive.google.com/file/d/1ZFtknedWZANrQQVgVqYxGMtCkhOvJ8hK/view?usp=sharing If anyone from data analytics background could help me, how to approach this? Thanks,Mohit Some of the details are given below: The file contains data of companies. Each row is a company in your company database. As you are aware there are a lot of duplicated companies so your database are marking them as “Invalid” in Column J “Flag” in the old algorithm. The new algorithm has these “Flags” listed in Column K. So when it says Valid, that’s a company that is determined by the algorithm to be a good company + real company + not duplicated, to be kept in the database. There are additional data in the file for each of the companies to help you evaluate the companies. Some issues: 1) Some companies have many legitimate subsidiaries. Like Google and YouTube might be 2 companies but YouTube is a subsidiary of Google. What you have decided to do is that you want these to stay in your database as 2 separate companies, if these 3 conditions are met: a) the subsidiary is large and >$100M revenue, b) the name of that company looks substantially different from the parent, and c) that the identity of the subsidiary still exists because sometime the parent company just absorbs the subsidiary into the parent company and the subsidiary disappears ie their website no longer exists. In the Google / YouTube example, all three conditions are met, so both Google and YouTube are kept as different companies in your database. 2) There are many big companies that often have hundreds of subsidiaries that are all pretty much the same company. For example, Citibank can have many subsidiaries like Citibank Auto Loans, Citibank New York, Citibank Florida, and those typically look like the same company to most consumers, so you do not want to keep all those subsidiaries but just to keep the main parent company. 3) When we have multiple of the same company in our database that are exactly the same company, like the company name is the same and the url is the same. In those cases, you want to keep the company listing in the database that has the most information (e.g. revenue, employee #, etc.), the highest revenue, etc, and remove the ones with less. 4) There are often wrong/incorrect information and so of course you want to keep the database listing with the most accurate information. -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAJk6f4AftY473Q2ohbGzdPekhm7iybe349F69Xcqy_ZBrO3gTw%40mail.gmail.com. -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/6034e8a9.1c69fb81.f66f4.932e%40mx.google.com.
