There are many tools you could apply to this problem. Since I know
about recommenders, I can tell you about how the Mahout recommender
might apply. But this is just one approach.

It is fairly easy to map your input and output to a normal recommender
problem. The challenge is scale. I would:

- Your 'items' here to recommend are advertisements
- Ignore ad impressions. These are not going to be useful data points
- Use ad clicks. A click will establish a connection between a user and ad
- Since you typically have a click, or none at all, I would not
attempt to create some kind of 'rating' or preference value between
the user and ad. You simply have an associate, or don't have one. In
Mahout CF, this means using the 'Boolean' data model, algorithms, etc.
This we can discuss more later.
- To start, I'd ignore category and demographics

The only things that make this non-trivial are:

1. Your 'items' are very transient. While you want to use a lot of
historical click data to compute similarities between users, perhaps,
you never recommend an ad that isn't part of a currently running
campaign. To address this:

- Use a user-based recommender
- *Precompute* user-user similarities based on a similarity metric
like LogLikelihoodSimilarity -- don't do this at runtime, it'll be too
slow. Use all available data to perform these computations
- At runtime, create a data model which contains only current ads as
items. Feed it this precomputed similarities.

2. Large scale

- I would heavily filter your data. Obviously, don't consider users
with no clicks, or few clicks, or that aren't active. I find that,
typically, most data available to a recommender computation is noise
-- doesn't help the results. So achieving scalability usually involves
knowing what little data to bother keeping.
- Because you kind of need to pre-compute all those user-user
similarities, and they don't necessarily change often, this is
something you can parallelize using Hadoop. Mahout does not offer a
pre-built job to compute user-user similarities, but, you can easily
create a job that calls Mahout classes to do this
- Unfortunately, I don't think much more than this can be offline and
use distributed systems. You have to produce very fresh
recommendations in this system and can't simply compute
recommendations every night, or even every hour I guess.

You wanted some more technical detail. I won't bother discussing the
particular code or classes to use. At a high level I think you'd have:

- A job of some kind that periodically reads all your click data,
throws out the data which is probably not useful to this computation,
and stores the data in an HDFS cluster (Hadoop's storage system).
- A Hadoop cluster running a nightly job that recomputes all user-user
similarities
- Another job which stores these results in a database that is
available to your online systems
- Some database table which has, at least, currently running, active
ads and which users have clicked them (that is, not all click data,
just active ad click data)
- A server application which actually makes recommendations in
realtime. I imagine speed will be vital in an ad system, so would plan
to read all the user-user similarities into memory. This will probably
need to have lots of RAM. You can scale this by adding more servers.

That's roughly it, to get you started.


On Mon, Aug 24, 2009 at 2:32 PM, Benjamin
Dageroth<[email protected]> wrote:
> Hi,
>
> I got the demo up and running and am now trying to figure out how to go 
> forward with a few tries on my own to determine, whether we can actually use 
> Mahout. We are getting a lot of data on many users and would like to use this 
> data in order to provide more relevant ads - relevant not only according  to 
> the content of the side, but to the interests of the user and what he liked 
> in the past. So I know e.g. the type of site he is on (twenty categories), 
> the type of sites he has visited in the past, the ads he saw, the ads  
> clicked, including a category to which the ad belongs.
> Furthermore I'd like to build a profile of interests and if I can, I'd gather 
> some demographical data for a number of sites - this should enable me to use 
> naïve Bayes to deduce gender and age with some probability depending on the 
> recorded history of sites someone visited within the ad network.
>
> All this information I'd like to use in order to make recommendation which ad 
> to deliever, either because similar users cliked it, or because a user 
> clicked on ad, which has often been clicked with another ad. (item based, 
> user based depending which one provides a better result) Other interesting 
> data points would be time (are there specific times at which ads do perform 
> well or bad?) and location and  the actual combinations of site and ad.
>
> I am not a very good programmer and am working more the conceptual angle and 
> look for technologies which we could use. So I am not sure how to store the 
> data I collect (I created a database scheme) to make it available to Mahout, 
> as it seems to run on Hadoop and not with a normal database? I am still 
> looking for more documentation, so if you could point me to something or have 
> some idea how to proceed, I'd appreciate it.
>
> We definitely something which scales as the ad network is creating billions 
> of ad impressions per month with millions of users and Mahout seemed to be 
> the only thing, that seems suitable, although it is still pretty early in its 
> development process.
>
>
> Thanks for any pointers and opinions,
>
> Benjamin
>
>
> _______________________________________
> Benjamin Dageroth, Key Account Manager / Softwareentwickler
> Webtrekk GmbH
> Boxhagener Str. 76-78, 10245 Berlin
> fon 030 - 755 415 - 360
> fax 030 - 755 415 - 100
> [email protected]
> http://www.webtrekk.com
> Amtsgericht Berlin, HRB 93435 B
> Geschäftsführer Christian Sauer
>
>
> _______________________________________
>

Reply via email to