I am tinkering with the CLA a bit and want to play around with web browsing history data.
I'm trying to determine whether it would be feasible to predict the URL, or at least the top-level domain that is most likely to be visited next by a web surfer, based on their past browsing history. I might go so far as to make a multi-step prediction to short-circuit the navigation of a web surfer to directly the page they are interested in. First of all, I'm looking for feedback on whether this idea even makes sense as an application of the CLA, and whether anyone has tried something similar. Second, I'm a little bit stuck coming up with a good way to encode a URL for input to the SP. One thought is to break the URL into component fields (e.g., top-level domain, URL path and params). The problem is that the encoding should be adaptive and pick up values that have never been seen before. I'm uncertain how to approach this. Since there's no semantic similarity to be inferred between two different TLDs with similar names, a basic numeric encoding doesn't make sense. It might be reasonable to think that different URL paths with the same TLD and subdomain have some semantic similarity (e.g., maps.google.com/usa and maps.google.com/canada are both maps). I would also suggest that if two URLs share some path elements, they are even more similar. So ideally, I would come up with an encoding that has little or no overlap for different TLDs, more overlap with same TLDs and subdomain, and even more if they have the same TLD, subdomain and share path elements. Thoughts?
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
