Let me start by saying thank you for taking the time out to help, it is very much appreciated.
>It appears that you want to predict a continuous variable rather a >nominal level one so I don't see it as a classification problem. Ideally yes, but i'd be happy to use some sort of discretization process to convert the sales conversion to a class (ie, low, medium & high) >Do you have a limited set of words that applies to every case? "top" >"10" . . . ? with values yes and no or yes/no/does not apply? The set of words would continue to grow as more articles are analyzed >How many cases (entities, records, lines) do you have in your data set? About 50,000 >How many variables (attribute fields, columns)) to you want to use as >predictors? Do you have the one nominal level predictor (publication) >and 6 dichotomous predictors only? Based on my research it seemed like converting each word into an attribute was the way to go (akin to a customers shopping cart which has only a couple of the stores many products) >How many different values does the variable "publication" have? About 1,000 >What does "sales conversion" rate mean? For each article I know how many leads were produced and how many ended in sales. The "sales conversion" is simply this ratio (higher is better) >Why do you think having these words in the article field would be >predictive of sales conversion? Currently we simply use a list of 200 keywords to decide which articles to use. I'm pretty sure that this list can be improved. A good example is the word 'doctor'. The product being sold are plaques (the kind that doctors love to hand on their walls). I feel pretty sure that their are other patterns in the data waiting to be pulled out. ---------------------------------------------- CLASS-L list. Instructions: http://www.classification-society.org/csna/lists.html#class-l
