I think this is what you're looking for: https://github.com/subutai/nupic.subutai/tree/master/swarm_examples
--------- Matt Taylor OS Community Flag-Bearer Numenta On Thu, Apr 10, 2014 at 3:14 PM, Julie Pitt <[email protected]> wrote: > OK, I think I have my data set up to swarm over. I ended up creating > different fields for tld, domain, port, subdomain(s) and up to 6 path > elements (not sure yet if that's too much). At some point I thought someone > posted either a youtube video or a link to github with an example swarm > that uses multiple fields in the SDR and creates a model that predicts > those fields. If anyone is aware of such an example, I'd appreciate a > pointer! > > > On Wed, Apr 9, 2014 at 11:30 AM, Julie Pitt <[email protected]> wrote: > >> Thanks all for your very helpful suggestions. I'm working on this when I >> have scraps of time, so apologies for my latent responses. Subutai/Matt, I >> was thinking the same thing in terms of parsing a URL into its main >> components and encoding each one separately. And yes, there is the problem >> of not knowing how many URL path elements there will be. Perhaps a first >> cut is to just arbitrarily limit it and throw away the rest. I would also >> throw away URL params initially. >> >> Using a random encoder, there would obviously be no real semantic >> understanding of what these URLs mean, but I'm wondering if some level of >> understanding could be achieved by using multiple sensory regions for >> different parts of the URL and then forming a 2-level hierarchy to identify >> and predict sequences. If I can get anything interesting out of pure URL >> data, I would want to add temporal data to see if any predictions could be >> made in terms of a user's behavior throughout the day, week and year. >> (i.e., it's holiday season, so you'll probably be cruising Amazon). >> >> Back to Chetan's question about where I'm getting my data, I am >> harvesting my own browsing history from Chrome's sqlite DB. So, this is >> really a toy. I may conscript a few friends to share with me their history. >> Any volunteers? :-) >> >> As far as training goes, I'm thinking there's in pooling history from >> many people on which to learn "default" behaviors, and then keep learning >> turned on to get a feel for how that particular user behaves. The problem >> is how to feed the data in, because at that point you get interleaved time >> steps that are unrelated. There probably needs to be a concept of a user in >> there. >> >> >> On Tue, Apr 8, 2014 at 4:48 PM, Subutai Ahmad <[email protected]>wrote: >> >>> >>> Sure, I can look that up. I need to dig around for it. >>> >>> --Subutai >>> >>> >>> On Tue, Apr 8, 2014 at 12:52 PM, Marek Otahal <[email protected]>wrote: >>> >>>> Subutai, >>>> do you think you'd still dig up some paper or data from your prev. >>>> experiments? Would be interesting! >>>> Cheers, >>>> >>>> >>>> On Tue, Apr 8, 2014 at 5:35 PM, Subutai Ahmad <[email protected]>wrote: >>>> >>>>> Hi Julie, >>>>> >>>>> Just to add to everyone else's input, this is a great application area >>>>> for CLA's. I did some similar work a couple of years ago and got pretty >>>>> good results. >>>>> >>>>> In terms of encoders, the simplest is to just use the OPF and use the >>>>> "string" field type instead of float. Every new string that is encountered >>>>> will automatically get a new random representation. With this scheme each >>>>> new string will be treated as a completely unique token with no semantic >>>>> similarity to other URL's. You'll want to make sure the string doesn't >>>>> contain extraneous stuff since any difference will lead to a new >>>>> representation. >>>>> >>>>> You could break each URL into multiple fields as you suggested. Just >>>>> make each one a separate CSV field and each field into a string type. I >>>>> think this will achieve an effect that is similar to Chetan's suggestion. >>>>> In my experiment each URL represented a news article and had a natural >>>>> "topic" associated with it such as "business" or "politics" so I had a >>>>> "topic" field. >>>>> >>>>> For best results I would recommend starting with a smaller dataset >>>>> with a relatively small number of unique strings and then work your way up >>>>> from there. The amount of data you need to get good results will grow >>>>> fast >>>>> as the number of unique strings increases. You'll probably want to swarm >>>>> on the dataset as the parameters may need to be quite different from the >>>>> default hotgym parameters. >>>>> >>>>> I'm curious to see how this goes. Please send along your results and >>>>> questions as you make progress! >>>>> >>>>> --Subutai >>>>> >>>>> >>>>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote: >>>>> >>>>>> I am tinkering with the CLA a bit and want to play around with web >>>>>> browsing history data. >>>>>> >>>>>> I'm trying to determine whether it would be feasible to predict the >>>>>> URL, or at least the top-level domain that is most likely to be visited >>>>>> next by a web surfer, based on their past browsing history. I might go so >>>>>> far as to make a multi-step prediction to short-circuit the navigation >>>>>> of a >>>>>> web surfer to directly the page they are interested in. >>>>>> >>>>>> First of all, I'm looking for feedback on whether this idea even >>>>>> makes sense as an application of the CLA, and whether anyone has tried >>>>>> something similar. >>>>>> >>>>>> Second, I'm a little bit stuck coming up with a good way to encode a >>>>>> URL for input to the SP. One thought is to break the URL into component >>>>>> fields (e.g., top-level domain, URL path and params). The problem is that >>>>>> the encoding should be adaptive and pick up values that have never been >>>>>> seen before. I'm uncertain how to approach this. >>>>>> >>>>>> Since there's no semantic similarity to be inferred between two >>>>>> different TLDs with similar names, a basic numeric encoding doesn't make >>>>>> sense. >>>>>> >>>>>> It might be reasonable to think that different URL paths with the >>>>>> same TLD and subdomain have some semantic similarity (e.g., >>>>>> maps.google.com/usa and maps.google.com/canada are both maps). I >>>>>> would also suggest that if two URLs share some path elements, they are >>>>>> even >>>>>> more similar. So ideally, I would come up with an encoding that has >>>>>> little >>>>>> or no overlap for different TLDs, more overlap with same TLDs and >>>>>> subdomain, and even more if they have the same TLD, subdomain and share >>>>>> path elements. >>>>>> >>>>>> Thoughts? >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> >>>> >>>> -- >>>> Marek Otahal :o) >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
