Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Julie Pitt Wed, 09 Apr 2014 11:32:05 -0700

Thanks all for your very helpful suggestions. I'm working on this when I
have scraps of time, so apologies for my latent responses. Subutai/Matt, I
was thinking the same thing in terms of parsing a URL into its main
components and encoding each one separately. And yes, there is the problem
of not knowing how many URL path elements there will be. Perhaps a first
cut is to just arbitrarily limit it and throw away the rest. I would also
throw away URL params initially.


Using a random encoder, there would obviously be no real semantic
understanding of what these URLs mean, but I'm wondering if some level of
understanding could be achieved by using multiple sensory regions for
different parts of the URL and then forming a 2-level hierarchy to identify
and predict sequences. If I can get anything interesting out of pure URL
data, I would want to add temporal data to see if any predictions could be
made in terms of a user's behavior throughout the day, week and year.
(i.e., it's holiday season, so you'll probably be cruising Amazon).

Back to Chetan's question about where I'm getting my data, I am harvesting
my own browsing history from Chrome's sqlite DB. So, this is really a toy.
I may conscript a few friends to share with me their history. Any
volunteers? :-)

As far as training goes, I'm thinking there's in pooling history from many
people on which to learn "default" behaviors, and then keep learning turned
on to get a feel for how that particular user behaves. The problem is how
to feed the data in, because at that point you get interleaved time steps
that are unrelated. There probably needs to be a concept of a user in there.


On Tue, Apr 8, 2014 at 4:48 PM, Subutai Ahmad <[email protected]> wrote:

>
> Sure, I can look that up.  I need to dig around for it.
>
> --Subutai
>
>
> On Tue, Apr 8, 2014 at 12:52 PM, Marek Otahal <[email protected]>wrote:
>
>> Subutai,
>> do you think you'd still dig up some paper or data from your prev.
>> experiments? Would be interesting!
>> Cheers,
>>
>>
>> On Tue, Apr 8, 2014 at 5:35 PM, Subutai Ahmad <[email protected]>wrote:
>>
>>> Hi Julie,
>>>
>>> Just to add to everyone else's input, this is a great application area
>>> for CLA's. I did some similar work a couple of years ago and got pretty
>>> good results.
>>>
>>> In terms of encoders, the simplest is to just use the OPF and use the
>>> "string" field type instead of float. Every new string that is encountered
>>> will automatically get a new random representation.  With this scheme each
>>> new string will be treated as a completely unique token with no semantic
>>> similarity to other URL's.  You'll want to make sure the string doesn't
>>> contain extraneous stuff since any difference will lead to a new
>>> representation.
>>>
>>> You could break each URL into multiple fields as you suggested. Just
>>> make each one a separate CSV field and each field into a string type.  I
>>> think this will achieve an effect that is similar to Chetan's suggestion.
>>> In my experiment each URL represented a news article and had a natural
>>> "topic" associated with it such as "business" or "politics" so I had a
>>> "topic" field.
>>>
>>> For best results I would recommend starting with a smaller dataset with
>>> a relatively small number of unique strings and then work your way up from
>>> there.  The amount of data you need to get good results will grow fast as
>>> the number of unique strings increases.  You'll probably want to swarm on
>>> the dataset as the parameters may need to be quite different from the
>>> default hotgym parameters.
>>>
>>> I'm curious to see how this goes. Please send along your results and
>>> questions as you make progress!
>>>
>>> --Subutai
>>>
>>>
>>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:
>>>
>>>>  I am tinkering with the CLA a bit and want to play around with web
>>>> browsing history data.
>>>>
>>>> I'm trying to determine whether it would be feasible to predict the
>>>> URL, or at least the top-level domain that is most likely to be visited
>>>> next by a web surfer, based on their past browsing history. I might go so
>>>> far as to make a multi-step prediction to short-circuit the navigation of a
>>>> web surfer to directly the page they are interested in.
>>>>
>>>> First of all, I'm looking for feedback on whether this idea even makes
>>>> sense as an application of the CLA, and whether anyone has tried something
>>>> similar.
>>>>
>>>> Second, I'm a little bit stuck coming up with a good way to encode a
>>>> URL for input to the SP. One thought is to break the URL into component
>>>> fields (e.g., top-level domain, URL path and params). The problem is that
>>>> the encoding should be adaptive and pick up values that have never been
>>>> seen before. I'm uncertain how to approach this.
>>>>
>>>> Since there's no semantic similarity to be inferred between two
>>>> different TLDs with similar names, a basic numeric encoding doesn't make
>>>> sense.
>>>>
>>>> It might be reasonable to think that different URL paths with the same
>>>> TLD and subdomain have some semantic similarity (e.g.,
>>>> maps.google.com/usa and maps.google.com/canada are both maps). I would
>>>> also suggest that if two URLs share some path elements, they are even more
>>>> similar. So ideally, I would come up with an encoding that has little or no
>>>> overlap for different TLDs, more overlap with same TLDs and subdomain, and
>>>> even more if they have the same TLD, subdomain and share path elements.
>>>>
>>>> Thoughts?
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>>
>> --
>> Marek Otahal :o)
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to