Hi Matt, Great idea, but would definitely be best to split the path into separate fields, as the paths form a semantic hierarchy (/sport/* would all share meaning, /news/* would be another, with /news/local/* and /news/us/*, /news/eu/* etc being subhierarchies).
Marek might be able to help with his VectorEncoder..? Regards, Fergal Byrne On Tue, Apr 8, 2014 at 5:39 PM, Matthew Taylor <[email protected]> wrote: > I was thinking, if you're going to break a URL into different fields, you > might it up into "protocol", "domain", "path", "query", and "hash": > > https://www.example.com/path/to/resource.html?param1=1¶m2=2#target > > Would be: > protocol: https > domain: www.example.com > path: path/to/resource.html > query=param1=1¶m2=2 > hash=target > > I'm not sure if that will help, but it is a typical URL breakout. > > --------- > Matt Taylor > OS Community Flag-Bearer > Numenta > > > On Tue, Apr 8, 2014 at 8:35 AM, Subutai Ahmad <[email protected]> wrote: > >> Hi Julie, >> >> Just to add to everyone else's input, this is a great application area >> for CLA's. I did some similar work a couple of years ago and got pretty >> good results. >> >> In terms of encoders, the simplest is to just use the OPF and use the >> "string" field type instead of float. Every new string that is encountered >> will automatically get a new random representation. With this scheme each >> new string will be treated as a completely unique token with no semantic >> similarity to other URL's. You'll want to make sure the string doesn't >> contain extraneous stuff since any difference will lead to a new >> representation. >> >> You could break each URL into multiple fields as you suggested. Just make >> each one a separate CSV field and each field into a string type. I think >> this will achieve an effect that is similar to Chetan's suggestion. In my >> experiment each URL represented a news article and had a natural "topic" >> associated with it such as "business" or "politics" so I had a "topic" >> field. >> >> For best results I would recommend starting with a smaller dataset with a >> relatively small number of unique strings and then work your way up from >> there. The amount of data you need to get good results will grow fast as >> the number of unique strings increases. You'll probably want to swarm on >> the dataset as the parameters may need to be quite different from the >> default hotgym parameters. >> >> I'm curious to see how this goes. Please send along your results and >> questions as you make progress! >> >> --Subutai >> >> >> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote: >> >>> I am tinkering with the CLA a bit and want to play around with web >>> browsing history data. >>> >>> I'm trying to determine whether it would be feasible to predict the URL, >>> or at least the top-level domain that is most likely to be visited next by >>> a web surfer, based on their past browsing history. I might go so far as to >>> make a multi-step prediction to short-circuit the navigation of a web >>> surfer to directly the page they are interested in. >>> >>> First of all, I'm looking for feedback on whether this idea even makes >>> sense as an application of the CLA, and whether anyone has tried something >>> similar. >>> >>> Second, I'm a little bit stuck coming up with a good way to encode a URL >>> for input to the SP. One thought is to break the URL into component fields >>> (e.g., top-level domain, URL path and params). The problem is that the >>> encoding should be adaptive and pick up values that have never been seen >>> before. I'm uncertain how to approach this. >>> >>> Since there's no semantic similarity to be inferred between two >>> different TLDs with similar names, a basic numeric encoding doesn't make >>> sense. >>> >>> It might be reasonable to think that different URL paths with the same >>> TLD and subdomain have some semantic similarity (e.g., >>> maps.google.com/usa and maps.google.com/canada are both maps). I would >>> also suggest that if two URLs share some path elements, they are even more >>> similar. So ideally, I would come up with an encoding that has little or no >>> overlap for different TLDs, more overlap with same TLDs and subdomain, and >>> even more if they have the same TLD, subdomain and share path elements. >>> >>> Thoughts? >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > -- Fergal Byrne, Brenter IT Author, Real Machine Intelligence with Clortex and NuPIC https://leanpub.com/realsmartmachines <http://www.examsupport.ie>http://inbits.com - Better Living through Thoughtful Technology http://ie.linkedin.com/in/fergbyrne/ https://github.com/fergalbyrne e:[email protected] t:+353 83 4214179 Formerly of Adnet [email protected] http://www.adnet.ie
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
