Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Matthew Taylor Thu, 10 Apr 2014 15:29:30 -0700

I think this is what you're looking for:
https://github.com/subutai/nupic.subutai/tree/master/swarm_examples


---------
Matt Taylor
OS Community Flag-Bearer
Numenta


On Thu, Apr 10, 2014 at 3:14 PM, Julie Pitt <[email protected]> wrote:

> OK, I think I have my data set up to swarm over. I ended up creating
> different fields for tld, domain, port, subdomain(s) and up to 6 path
> elements (not sure yet if that's too much). At some point I thought someone
> posted either a youtube video or a link to github with an example swarm
> that uses multiple fields in the SDR and creates a model that predicts
> those fields. If anyone is aware of such an example, I'd appreciate a
> pointer!
>
>
> On Wed, Apr 9, 2014 at 11:30 AM, Julie Pitt <[email protected]> wrote:
>
>> Thanks all for your very helpful suggestions. I'm working on this when I
>> have scraps of time, so apologies for my latent responses. Subutai/Matt, I
>> was thinking the same thing in terms of parsing a URL into its main
>> components and encoding each one separately. And yes, there is the problem
>> of not knowing how many URL path elements there will be. Perhaps a first
>> cut is to just arbitrarily limit it and throw away the rest. I would also
>> throw away URL params initially.
>>
>> Using a random encoder, there would obviously be no real semantic
>> understanding of what these URLs mean, but I'm wondering if some level of
>> understanding could be achieved by using multiple sensory regions for
>> different parts of the URL and then forming a 2-level hierarchy to identify
>> and predict sequences. If I can get anything interesting out of pure URL
>> data, I would want to add temporal data to see if any predictions could be
>> made in terms of a user's behavior throughout the day, week and year.
>> (i.e., it's holiday season, so you'll probably be cruising Amazon).
>>
>> Back to Chetan's question about where I'm getting my data, I am
>> harvesting my own browsing history from Chrome's sqlite DB. So, this is
>> really a toy. I may conscript a few friends to share with me their history.
>> Any volunteers? :-)
>>
>> As far as training goes, I'm thinking there's in pooling history from
>> many people on which to learn "default" behaviors, and then keep learning
>> turned on to get a feel for how that particular user behaves. The problem
>> is how to feed the data in, because at that point you get interleaved time
>> steps that are unrelated. There probably needs to be a concept of a user in
>> there.
>>
>>
>> On Tue, Apr 8, 2014 at 4:48 PM, Subutai Ahmad <[email protected]>wrote:
>>
>>>
>>> Sure, I can look that up.  I need to dig around for it.
>>>
>>> --Subutai
>>>
>>>
>>> On Tue, Apr 8, 2014 at 12:52 PM, Marek Otahal <[email protected]>wrote:
>>>
>>>> Subutai,
>>>> do you think you'd still dig up some paper or data from your prev.
>>>> experiments? Would be interesting!
>>>> Cheers,
>>>>
>>>>
>>>> On Tue, Apr 8, 2014 at 5:35 PM, Subutai Ahmad <[email protected]>wrote:
>>>>
>>>>> Hi Julie,
>>>>>
>>>>> Just to add to everyone else's input, this is a great application area
>>>>> for CLA's. I did some similar work a couple of years ago and got pretty
>>>>> good results.
>>>>>
>>>>> In terms of encoders, the simplest is to just use the OPF and use the
>>>>> "string" field type instead of float. Every new string that is encountered
>>>>> will automatically get a new random representation.  With this scheme each
>>>>> new string will be treated as a completely unique token with no semantic
>>>>> similarity to other URL's.  You'll want to make sure the string doesn't
>>>>> contain extraneous stuff since any difference will lead to a new
>>>>> representation.
>>>>>
>>>>> You could break each URL into multiple fields as you suggested. Just
>>>>> make each one a separate CSV field and each field into a string type.  I
>>>>> think this will achieve an effect that is similar to Chetan's suggestion.
>>>>> In my experiment each URL represented a news article and had a natural
>>>>> "topic" associated with it such as "business" or "politics" so I had a
>>>>> "topic" field.
>>>>>
>>>>> For best results I would recommend starting with a smaller dataset
>>>>> with a relatively small number of unique strings and then work your way up
>>>>> from there.  The amount of data you need to get good results will grow 
>>>>> fast
>>>>> as the number of unique strings increases.  You'll probably want to swarm
>>>>> on the dataset as the parameters may need to be quite different from the
>>>>> default hotgym parameters.
>>>>>
>>>>> I'm curious to see how this goes. Please send along your results and
>>>>> questions as you make progress!
>>>>>
>>>>> --Subutai
>>>>>
>>>>>
>>>>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:
>>>>>
>>>>>>  I am tinkering with the CLA a bit and want to play around with web
>>>>>> browsing history data.
>>>>>>
>>>>>> I'm trying to determine whether it would be feasible to predict the
>>>>>> URL, or at least the top-level domain that is most likely to be visited
>>>>>> next by a web surfer, based on their past browsing history. I might go so
>>>>>> far as to make a multi-step prediction to short-circuit the navigation 
>>>>>> of a
>>>>>> web surfer to directly the page they are interested in.
>>>>>>
>>>>>> First of all, I'm looking for feedback on whether this idea even
>>>>>> makes sense as an application of the CLA, and whether anyone has tried
>>>>>> something similar.
>>>>>>
>>>>>> Second, I'm a little bit stuck coming up with a good way to encode a
>>>>>> URL for input to the SP. One thought is to break the URL into component
>>>>>> fields (e.g., top-level domain, URL path and params). The problem is that
>>>>>> the encoding should be adaptive and pick up values that have never been
>>>>>> seen before. I'm uncertain how to approach this.
>>>>>>
>>>>>> Since there's no semantic similarity to be inferred between two
>>>>>> different TLDs with similar names, a basic numeric encoding doesn't make
>>>>>> sense.
>>>>>>
>>>>>> It might be reasonable to think that different URL paths with the
>>>>>> same TLD and subdomain have some semantic similarity (e.g.,
>>>>>> maps.google.com/usa and maps.google.com/canada are both maps). I
>>>>>> would also suggest that if two URLs share some path elements, they are 
>>>>>> even
>>>>>> more similar. So ideally, I would come up with an encoding that has 
>>>>>> little
>>>>>> or no overlap for different TLDs, more overlap with same TLDs and
>>>>>> subdomain, and even more if they have the same TLD, subdomain and share
>>>>>> path elements.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Marek Otahal :o)
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to