Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Julie Pitt Thu, 10 Apr 2014 15:15:00 -0700

OK, I think I have my data set up to swarm over. I ended up creating
different fields for tld, domain, port, subdomain(s) and up to 6 path
elements (not sure yet if that's too much). At some point I thought someone
posted either a youtube video or a link to github with an example swarm
that uses multiple fields in the SDR and creates a model that predicts
those fields. If anyone is aware of such an example, I'd appreciate a
pointer!



On Wed, Apr 9, 2014 at 11:30 AM, Julie Pitt <[email protected]> wrote:

> Thanks all for your very helpful suggestions. I'm working on this when I
> have scraps of time, so apologies for my latent responses. Subutai/Matt, I
> was thinking the same thing in terms of parsing a URL into its main
> components and encoding each one separately. And yes, there is the problem
> of not knowing how many URL path elements there will be. Perhaps a first
> cut is to just arbitrarily limit it and throw away the rest. I would also
> throw away URL params initially.
>
> Using a random encoder, there would obviously be no real semantic
> understanding of what these URLs mean, but I'm wondering if some level of
> understanding could be achieved by using multiple sensory regions for
> different parts of the URL and then forming a 2-level hierarchy to identify
> and predict sequences. If I can get anything interesting out of pure URL
> data, I would want to add temporal data to see if any predictions could be
> made in terms of a user's behavior throughout the day, week and year.
> (i.e., it's holiday season, so you'll probably be cruising Amazon).
>
> Back to Chetan's question about where I'm getting my data, I am harvesting
> my own browsing history from Chrome's sqlite DB. So, this is really a toy.
> I may conscript a few friends to share with me their history. Any
> volunteers? :-)
>
> As far as training goes, I'm thinking there's in pooling history from many
> people on which to learn "default" behaviors, and then keep learning turned
> on to get a feel for how that particular user behaves. The problem is how
> to feed the data in, because at that point you get interleaved time steps
> that are unrelated. There probably needs to be a concept of a user in there.
>
>
> On Tue, Apr 8, 2014 at 4:48 PM, Subutai Ahmad <[email protected]> wrote:
>
>>
>> Sure, I can look that up.  I need to dig around for it.
>>
>> --Subutai
>>
>>
>> On Tue, Apr 8, 2014 at 12:52 PM, Marek Otahal <[email protected]>wrote:
>>
>>> Subutai,
>>> do you think you'd still dig up some paper or data from your prev.
>>> experiments? Would be interesting!
>>> Cheers,
>>>
>>>
>>> On Tue, Apr 8, 2014 at 5:35 PM, Subutai Ahmad <[email protected]>wrote:
>>>
>>>> Hi Julie,
>>>>
>>>> Just to add to everyone else's input, this is a great application area
>>>> for CLA's. I did some similar work a couple of years ago and got pretty
>>>> good results.
>>>>
>>>> In terms of encoders, the simplest is to just use the OPF and use the
>>>> "string" field type instead of float. Every new string that is encountered
>>>> will automatically get a new random representation.  With this scheme each
>>>> new string will be treated as a completely unique token with no semantic
>>>> similarity to other URL's.  You'll want to make sure the string doesn't
>>>> contain extraneous stuff since any difference will lead to a new
>>>> representation.
>>>>
>>>> You could break each URL into multiple fields as you suggested. Just
>>>> make each one a separate CSV field and each field into a string type.  I
>>>> think this will achieve an effect that is similar to Chetan's suggestion.
>>>> In my experiment each URL represented a news article and had a natural
>>>> "topic" associated with it such as "business" or "politics" so I had a
>>>> "topic" field.
>>>>
>>>> For best results I would recommend starting with a smaller dataset with
>>>> a relatively small number of unique strings and then work your way up from
>>>> there.  The amount of data you need to get good results will grow fast as
>>>> the number of unique strings increases.  You'll probably want to swarm on
>>>> the dataset as the parameters may need to be quite different from the
>>>> default hotgym parameters.
>>>>
>>>> I'm curious to see how this goes. Please send along your results and
>>>> questions as you make progress!
>>>>
>>>> --Subutai
>>>>
>>>>
>>>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:
>>>>
>>>>>  I am tinkering with the CLA a bit and want to play around with web
>>>>> browsing history data.
>>>>>
>>>>> I'm trying to determine whether it would be feasible to predict the
>>>>> URL, or at least the top-level domain that is most likely to be visited
>>>>> next by a web surfer, based on their past browsing history. I might go so
>>>>> far as to make a multi-step prediction to short-circuit the navigation of 
>>>>> a
>>>>> web surfer to directly the page they are interested in.
>>>>>
>>>>> First of all, I'm looking for feedback on whether this idea even makes
>>>>> sense as an application of the CLA, and whether anyone has tried something
>>>>> similar.
>>>>>
>>>>> Second, I'm a little bit stuck coming up with a good way to encode a
>>>>> URL for input to the SP. One thought is to break the URL into component
>>>>> fields (e.g., top-level domain, URL path and params). The problem is that
>>>>> the encoding should be adaptive and pick up values that have never been
>>>>> seen before. I'm uncertain how to approach this.
>>>>>
>>>>> Since there's no semantic similarity to be inferred between two
>>>>> different TLDs with similar names, a basic numeric encoding doesn't make
>>>>> sense.
>>>>>
>>>>> It might be reasonable to think that different URL paths with the same
>>>>> TLD and subdomain have some semantic similarity (e.g.,
>>>>> maps.google.com/usa and maps.google.com/canada are both maps). I
>>>>> would also suggest that if two URLs share some path elements, they are 
>>>>> even
>>>>> more similar. So ideally, I would come up with an encoding that has little
>>>>> or no overlap for different TLDs, more overlap with same TLDs and
>>>>> subdomain, and even more if they have the same TLD, subdomain and share
>>>>> path elements.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Marek Otahal :o)
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to