OK, I think I have my data set up to swarm over. I ended up creating different fields for tld, domain, port, subdomain(s) and up to 6 path elements (not sure yet if that's too much). At some point I thought someone posted either a youtube video or a link to github with an example swarm that uses multiple fields in the SDR and creates a model that predicts those fields. If anyone is aware of such an example, I'd appreciate a pointer!
On Wed, Apr 9, 2014 at 11:30 AM, Julie Pitt <[email protected]> wrote: > Thanks all for your very helpful suggestions. I'm working on this when I > have scraps of time, so apologies for my latent responses. Subutai/Matt, I > was thinking the same thing in terms of parsing a URL into its main > components and encoding each one separately. And yes, there is the problem > of not knowing how many URL path elements there will be. Perhaps a first > cut is to just arbitrarily limit it and throw away the rest. I would also > throw away URL params initially. > > Using a random encoder, there would obviously be no real semantic > understanding of what these URLs mean, but I'm wondering if some level of > understanding could be achieved by using multiple sensory regions for > different parts of the URL and then forming a 2-level hierarchy to identify > and predict sequences. If I can get anything interesting out of pure URL > data, I would want to add temporal data to see if any predictions could be > made in terms of a user's behavior throughout the day, week and year. > (i.e., it's holiday season, so you'll probably be cruising Amazon). > > Back to Chetan's question about where I'm getting my data, I am harvesting > my own browsing history from Chrome's sqlite DB. So, this is really a toy. > I may conscript a few friends to share with me their history. Any > volunteers? :-) > > As far as training goes, I'm thinking there's in pooling history from many > people on which to learn "default" behaviors, and then keep learning turned > on to get a feel for how that particular user behaves. The problem is how > to feed the data in, because at that point you get interleaved time steps > that are unrelated. There probably needs to be a concept of a user in there. > > > On Tue, Apr 8, 2014 at 4:48 PM, Subutai Ahmad <[email protected]> wrote: > >> >> Sure, I can look that up. I need to dig around for it. >> >> --Subutai >> >> >> On Tue, Apr 8, 2014 at 12:52 PM, Marek Otahal <[email protected]>wrote: >> >>> Subutai, >>> do you think you'd still dig up some paper or data from your prev. >>> experiments? Would be interesting! >>> Cheers, >>> >>> >>> On Tue, Apr 8, 2014 at 5:35 PM, Subutai Ahmad <[email protected]>wrote: >>> >>>> Hi Julie, >>>> >>>> Just to add to everyone else's input, this is a great application area >>>> for CLA's. I did some similar work a couple of years ago and got pretty >>>> good results. >>>> >>>> In terms of encoders, the simplest is to just use the OPF and use the >>>> "string" field type instead of float. Every new string that is encountered >>>> will automatically get a new random representation. With this scheme each >>>> new string will be treated as a completely unique token with no semantic >>>> similarity to other URL's. You'll want to make sure the string doesn't >>>> contain extraneous stuff since any difference will lead to a new >>>> representation. >>>> >>>> You could break each URL into multiple fields as you suggested. Just >>>> make each one a separate CSV field and each field into a string type. I >>>> think this will achieve an effect that is similar to Chetan's suggestion. >>>> In my experiment each URL represented a news article and had a natural >>>> "topic" associated with it such as "business" or "politics" so I had a >>>> "topic" field. >>>> >>>> For best results I would recommend starting with a smaller dataset with >>>> a relatively small number of unique strings and then work your way up from >>>> there. The amount of data you need to get good results will grow fast as >>>> the number of unique strings increases. You'll probably want to swarm on >>>> the dataset as the parameters may need to be quite different from the >>>> default hotgym parameters. >>>> >>>> I'm curious to see how this goes. Please send along your results and >>>> questions as you make progress! >>>> >>>> --Subutai >>>> >>>> >>>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote: >>>> >>>>> I am tinkering with the CLA a bit and want to play around with web >>>>> browsing history data. >>>>> >>>>> I'm trying to determine whether it would be feasible to predict the >>>>> URL, or at least the top-level domain that is most likely to be visited >>>>> next by a web surfer, based on their past browsing history. I might go so >>>>> far as to make a multi-step prediction to short-circuit the navigation of >>>>> a >>>>> web surfer to directly the page they are interested in. >>>>> >>>>> First of all, I'm looking for feedback on whether this idea even makes >>>>> sense as an application of the CLA, and whether anyone has tried something >>>>> similar. >>>>> >>>>> Second, I'm a little bit stuck coming up with a good way to encode a >>>>> URL for input to the SP. One thought is to break the URL into component >>>>> fields (e.g., top-level domain, URL path and params). The problem is that >>>>> the encoding should be adaptive and pick up values that have never been >>>>> seen before. I'm uncertain how to approach this. >>>>> >>>>> Since there's no semantic similarity to be inferred between two >>>>> different TLDs with similar names, a basic numeric encoding doesn't make >>>>> sense. >>>>> >>>>> It might be reasonable to think that different URL paths with the same >>>>> TLD and subdomain have some semantic similarity (e.g., >>>>> maps.google.com/usa and maps.google.com/canada are both maps). I >>>>> would also suggest that if two URLs share some path elements, they are >>>>> even >>>>> more similar. So ideally, I would come up with an encoding that has little >>>>> or no overlap for different TLDs, more overlap with same TLDs and >>>>> subdomain, and even more if they have the same TLD, subdomain and share >>>>> path elements. >>>>> >>>>> Thoughts? >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >>> >>> -- >>> Marek Otahal :o) >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
