Hi All,

I am trying to evaluate 2 algos that classify companies based on certain
criteria. I need to compare which one is doing a better job at
classification. Data file can be found at below G drive link. Results of
both algos are in column J & K.

https://drive.google.com/file/d/1ZFtknedWZANrQQVgVqYxGMtCkhOvJ8hK/view?usp=sharing

If anyone from data analytics background could help me, how to approach
this?

Thanks,
Mohit

Some of the details are given below:

The file contains data of companies.  Each row is a company in your company
database.  As you are aware there are a lot of duplicated companies so your
database are marking them as “Invalid” in Column J “Flag” in the old
algorithm.  The new algorithm has these “Flags” listed in Column K.  So
when it says Valid, that’s a company that is determined by the algorithm to
be a good company + real company + not duplicated, to be kept in the
database.

There are additional data in the file for each of the companies to help you
evaluate the companies.

Some issues:


1) Some companies have many legitimate subsidiaries.  Like Google and
YouTube might be 2 companies but YouTube is a subsidiary of Google.  What
you have decided to do is that you want these to stay in your database as 2
separate companies, if these 3 conditions are met:
a) the subsidiary is large and >$100M revenue,
b) the name of that company looks substantially different from the parent,
and
c) that the identity of the subsidiary still exists because sometime the
parent company just absorbs the subsidiary into the parent company and the
subsidiary disappears ie their website no longer exists.
In the Google / YouTube example, all three conditions are met, so both
Google and YouTube are kept as different companies in your database.

2) There are many big companies that often have hundreds of subsidiaries
that are all pretty much the same company.  For example, Citibank can have
many subsidiaries like Citibank Auto Loans, Citibank New York, Citibank
Florida, and those typically look like the same company to most consumers,
so you do not want to keep all those subsidiaries but just to keep the main
parent company.

3) When we have multiple of the same company in our database that are
exactly the same company, like the company name is the same and the url is
the same.  In those cases, you want to keep the company listing in the
database that has the most information (e.g. revenue, employee #, etc.),
the highest revenue, etc, and remove the ones with less.

4) There are often wrong/incorrect information and so of course you want to
keep the database listing with the most accurate information.

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/datameet/CAJk6f4AftY473Q2ohbGzdPekhm7iybe349F69Xcqy_ZBrO3gTw%40mail.gmail.com.

Reply via email to