Hi,
I want to be able to discover the 10 most popular routes through our web site that lead a visitor to register with us. I am already logging page view data but don't seem to be able to find the best solution to query it. (Each Visitor has an ID, each Visitor makes multiple Visits, each with an ID, and each page request has an ID. I can join across all of these fields). The site is quite high traffic, > 3 million page views per month, so the solution needs to scale. That said, I envisage that for the next year or so the dataset would fit on a single machine. So far my research on Hadoop seems to suggest that it's generally only beneficial to use it if the dataset is large enough to warrant running a couple of Hadoop nodes. Is this problem even suited to Hadoop? How might I go about solving this problem? My initial thought was to use a graph database, and traverse from the registration page node outwards, selecting the next node as that which has the most links between itself and the registration node. I'm running into difficulties here, and wondered whether Hadoop might offer an alternative approach. Any pointers would be greatly appreciated. Thanks, Tim