I was thinking, if you're going to break a URL into different fields, you might it up into "protocol", "domain", "path", "query", and "hash":
https://www.example.com/path/to/resource.html?param1=1¶m2=2#target Would be: protocol: https domain: www.example.com path: path/to/resource.html query=param1=1¶m2=2 hash=target I'm not sure if that will help, but it is a typical URL breakout. --------- Matt Taylor OS Community Flag-Bearer Numenta On Tue, Apr 8, 2014 at 8:35 AM, Subutai Ahmad <[email protected]> wrote: > Hi Julie, > > Just to add to everyone else's input, this is a great application area for > CLA's. I did some similar work a couple of years ago and got pretty good > results. > > In terms of encoders, the simplest is to just use the OPF and use the > "string" field type instead of float. Every new string that is encountered > will automatically get a new random representation. With this scheme each > new string will be treated as a completely unique token with no semantic > similarity to other URL's. You'll want to make sure the string doesn't > contain extraneous stuff since any difference will lead to a new > representation. > > You could break each URL into multiple fields as you suggested. Just make > each one a separate CSV field and each field into a string type. I think > this will achieve an effect that is similar to Chetan's suggestion. In my > experiment each URL represented a news article and had a natural "topic" > associated with it such as "business" or "politics" so I had a "topic" > field. > > For best results I would recommend starting with a smaller dataset with a > relatively small number of unique strings and then work your way up from > there. The amount of data you need to get good results will grow fast as > the number of unique strings increases. You'll probably want to swarm on > the dataset as the parameters may need to be quite different from the > default hotgym parameters. > > I'm curious to see how this goes. Please send along your results and > questions as you make progress! > > --Subutai > > > On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote: > >> I am tinkering with the CLA a bit and want to play around with web >> browsing history data. >> >> I'm trying to determine whether it would be feasible to predict the URL, >> or at least the top-level domain that is most likely to be visited next by >> a web surfer, based on their past browsing history. I might go so far as to >> make a multi-step prediction to short-circuit the navigation of a web >> surfer to directly the page they are interested in. >> >> First of all, I'm looking for feedback on whether this idea even makes >> sense as an application of the CLA, and whether anyone has tried something >> similar. >> >> Second, I'm a little bit stuck coming up with a good way to encode a URL >> for input to the SP. One thought is to break the URL into component fields >> (e.g., top-level domain, URL path and params). The problem is that the >> encoding should be adaptive and pick up values that have never been seen >> before. I'm uncertain how to approach this. >> >> Since there's no semantic similarity to be inferred between two different >> TLDs with similar names, a basic numeric encoding doesn't make sense. >> >> It might be reasonable to think that different URL paths with the same >> TLD and subdomain have some semantic similarity (e.g., >> maps.google.com/usa and maps.google.com/canada are both maps). I would >> also suggest that if two URLs share some path elements, they are even more >> similar. So ideally, I would come up with an encoding that has little or no >> overlap for different TLDs, more overlap with same TLDs and subdomain, and >> even more if they have the same TLD, subdomain and share path elements. >> >> Thoughts? >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
