That's it! Thank you.
On Thu, Apr 10, 2014 at 3:28 PM, Matthew Taylor <[email protected]> wrote: > I think this is what you're looking for: > https://github.com/subutai/nupic.subutai/tree/master/swarm_examples > > --------- > Matt Taylor > OS Community Flag-Bearer > Numenta > > > On Thu, Apr 10, 2014 at 3:14 PM, Julie Pitt <[email protected]> wrote: > >> OK, I think I have my data set up to swarm over. I ended up creating >> different fields for tld, domain, port, subdomain(s) and up to 6 path >> elements (not sure yet if that's too much). At some point I thought someone >> posted either a youtube video or a link to github with an example swarm >> that uses multiple fields in the SDR and creates a model that predicts >> those fields. If anyone is aware of such an example, I'd appreciate a >> pointer! >> >> >> On Wed, Apr 9, 2014 at 11:30 AM, Julie Pitt <[email protected]> wrote: >> >>> Thanks all for your very helpful suggestions. I'm working on this when I >>> have scraps of time, so apologies for my latent responses. Subutai/Matt, I >>> was thinking the same thing in terms of parsing a URL into its main >>> components and encoding each one separately. And yes, there is the problem >>> of not knowing how many URL path elements there will be. Perhaps a first >>> cut is to just arbitrarily limit it and throw away the rest. I would also >>> throw away URL params initially. >>> >>> Using a random encoder, there would obviously be no real semantic >>> understanding of what these URLs mean, but I'm wondering if some level of >>> understanding could be achieved by using multiple sensory regions for >>> different parts of the URL and then forming a 2-level hierarchy to identify >>> and predict sequences. If I can get anything interesting out of pure URL >>> data, I would want to add temporal data to see if any predictions could be >>> made in terms of a user's behavior throughout the day, week and year. >>> (i.e., it's holiday season, so you'll probably be cruising Amazon). >>> >>> Back to Chetan's question about where I'm getting my data, I am >>> harvesting my own browsing history from Chrome's sqlite DB. So, this is >>> really a toy. I may conscript a few friends to share with me their history. >>> Any volunteers? :-) >>> >>> As far as training goes, I'm thinking there's in pooling history from >>> many people on which to learn "default" behaviors, and then keep learning >>> turned on to get a feel for how that particular user behaves. The problem >>> is how to feed the data in, because at that point you get interleaved time >>> steps that are unrelated. There probably needs to be a concept of a user in >>> there. >>> >>> >>> On Tue, Apr 8, 2014 at 4:48 PM, Subutai Ahmad <[email protected]>wrote: >>> >>>> >>>> Sure, I can look that up. I need to dig around for it. >>>> >>>> --Subutai >>>> >>>> >>>> On Tue, Apr 8, 2014 at 12:52 PM, Marek Otahal <[email protected]>wrote: >>>> >>>>> Subutai, >>>>> do you think you'd still dig up some paper or data from your prev. >>>>> experiments? Would be interesting! >>>>> Cheers, >>>>> >>>>> >>>>> On Tue, Apr 8, 2014 at 5:35 PM, Subutai Ahmad <[email protected]>wrote: >>>>> >>>>>> Hi Julie, >>>>>> >>>>>> Just to add to everyone else's input, this is a great application >>>>>> area for CLA's. I did some similar work a couple of years ago and got >>>>>> pretty good results. >>>>>> >>>>>> In terms of encoders, the simplest is to just use the OPF and use the >>>>>> "string" field type instead of float. Every new string that is >>>>>> encountered >>>>>> will automatically get a new random representation. With this scheme >>>>>> each >>>>>> new string will be treated as a completely unique token with no semantic >>>>>> similarity to other URL's. You'll want to make sure the string doesn't >>>>>> contain extraneous stuff since any difference will lead to a new >>>>>> representation. >>>>>> >>>>>> You could break each URL into multiple fields as you suggested. Just >>>>>> make each one a separate CSV field and each field into a string type. I >>>>>> think this will achieve an effect that is similar to Chetan's suggestion. >>>>>> In my experiment each URL represented a news article and had a natural >>>>>> "topic" associated with it such as "business" or "politics" so I had a >>>>>> "topic" field. >>>>>> >>>>>> For best results I would recommend starting with a smaller dataset >>>>>> with a relatively small number of unique strings and then work your way >>>>>> up >>>>>> from there. The amount of data you need to get good results will grow >>>>>> fast >>>>>> as the number of unique strings increases. You'll probably want to swarm >>>>>> on the dataset as the parameters may need to be quite different from the >>>>>> default hotgym parameters. >>>>>> >>>>>> I'm curious to see how this goes. Please send along your results and >>>>>> questions as you make progress! >>>>>> >>>>>> --Subutai >>>>>> >>>>>> >>>>>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]>wrote: >>>>>> >>>>>>> I am tinkering with the CLA a bit and want to play around with web >>>>>>> browsing history data. >>>>>>> >>>>>>> I'm trying to determine whether it would be feasible to predict the >>>>>>> URL, or at least the top-level domain that is most likely to be visited >>>>>>> next by a web surfer, based on their past browsing history. I might go >>>>>>> so >>>>>>> far as to make a multi-step prediction to short-circuit the navigation >>>>>>> of a >>>>>>> web surfer to directly the page they are interested in. >>>>>>> >>>>>>> First of all, I'm looking for feedback on whether this idea even >>>>>>> makes sense as an application of the CLA, and whether anyone has tried >>>>>>> something similar. >>>>>>> >>>>>>> Second, I'm a little bit stuck coming up with a good way to encode a >>>>>>> URL for input to the SP. One thought is to break the URL into component >>>>>>> fields (e.g., top-level domain, URL path and params). The problem is >>>>>>> that >>>>>>> the encoding should be adaptive and pick up values that have never been >>>>>>> seen before. I'm uncertain how to approach this. >>>>>>> >>>>>>> Since there's no semantic similarity to be inferred between two >>>>>>> different TLDs with similar names, a basic numeric encoding doesn't make >>>>>>> sense. >>>>>>> >>>>>>> It might be reasonable to think that different URL paths with the >>>>>>> same TLD and subdomain have some semantic similarity (e.g., >>>>>>> maps.google.com/usa and maps.google.com/canada are both maps). I >>>>>>> would also suggest that if two URLs share some path elements, they are >>>>>>> even >>>>>>> more similar. So ideally, I would come up with an encoding that has >>>>>>> little >>>>>>> or no overlap for different TLDs, more overlap with same TLDs and >>>>>>> subdomain, and even more if they have the same TLD, subdomain and share >>>>>>> path elements. >>>>>>> >>>>>>> Thoughts? >>>>>>> >>>>>>> _______________________________________________ >>>>>>> nupic mailing list >>>>>>> [email protected] >>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Marek Otahal :o) >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
