Julie, I think this web browsing prediction is a great project idea. It's inherently spatial and temporal in its patterns. I'd love to see what you can come up with, and what kinds of predictions can be made by the CLA over large datasets.
Out of curiosity, what datasets are you planning to training on? Regarding a URL encoder, my gut instinct is that an encoder that operates in a way similar to the Random Distributed Scalar Encoder might work well. (For details on the RDSE, see this presentation: [1]). We'll have to adapt it to work with URLs as you described. I haven't yet worked out the exact algorithm, but I imagine it to look something like this: 1. Start with the domain. If it is new, generate a random SDR for it, and remember the SDR. If it has been seen before, look up the remembered SDR. 2. Look at the subdomain. Select a subset of the bits from the SDR above, and replace them with new ones to represent the subdomain. If the subdomain is new, the new bits will be randomly generated and remembered. If the domain has been seen before, look up the remembered bits and use those for the replacement. (The number of bits in the subset will depend on how much overlap you want between similar URLs.) 3. Repeat #2 using the next path element in the hierarchy. Follow this process until you have gone through the entire URL. This way, URLs that share path elements as you move down the hierarchy will have shared bits in the SDR. Also, this method is flexible to support any number of path elements (but if you have too many, you'll saturate the number of unique representations available for unique URLs). Hope that makes sense, and that it works decently. You probably have to modify the algorithm a bit to make it work well, this is all just off the top of my head. - Chetan [1] https://www.youtube.com/watch?v=_q5W2Ov6C9E On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote: > I am tinkering with the CLA a bit and want to play around with web > browsing history data. > > I'm trying to determine whether it would be feasible to predict the URL, > or at least the top-level domain that is most likely to be visited next by > a web surfer, based on their past browsing history. I might go so far as to > make a multi-step prediction to short-circuit the navigation of a web > surfer to directly the page they are interested in. > > First of all, I'm looking for feedback on whether this idea even makes > sense as an application of the CLA, and whether anyone has tried something > similar. > > Second, I'm a little bit stuck coming up with a good way to encode a URL > for input to the SP. One thought is to break the URL into component fields > (e.g., top-level domain, URL path and params). The problem is that the > encoding should be adaptive and pick up values that have never been seen > before. I'm uncertain how to approach this. > > Since there's no semantic similarity to be inferred between two different > TLDs with similar names, a basic numeric encoding doesn't make sense. > > It might be reasonable to think that different URL paths with the same TLD > and subdomain have some semantic similarity (e.g., maps.google.com/usa > and maps.google.com/canada are both maps). I would also suggest that if > two URLs share some path elements, they are even more similar. So ideally, > I would come up with an encoding that has little or no overlap for > different TLDs, more overlap with same TLDs and subdomain, and even more if > they have the same TLD, subdomain and share path elements. > > Thoughts? > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
