Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Chetan Surpur Thu, 03 Apr 2014 17:43:49 -0700

Julie,

I think this web browsing prediction is a great project idea. It's
inherently spatial and temporal in its patterns. I'd love to see what you
can come up with, and what kinds of predictions can be made by the CLA over
large datasets.

Out of curiosity, what datasets are you planning to training on?

Regarding a URL encoder, my gut instinct is that an encoder that operates
in a way similar to the Random Distributed Scalar Encoder might work well.
(For details on the RDSE, see this presentation: [1]). We'll have to adapt
it to work with URLs as you described.

I haven't yet worked out the exact algorithm, but I imagine it to look
something like this:

1. Start with the domain. If it is new, generate a random SDR for it, and
remember the SDR. If it has been seen before, look up the remembered SDR.
2. Look at the subdomain. Select a subset of the bits from the SDR above,
and replace them with new ones to represent the subdomain. If the subdomain
is new, the new bits will be randomly generated and remembered. If the
domain has been seen before, look up the remembered bits and use those for
the replacement. (The number of bits in the subset will depend on how much
overlap you want between similar URLs.)
3. Repeat #2 using the next path element in the hierarchy.

Follow this process until you have gone through the entire URL. This way,
URLs that share path elements as you move down the hierarchy will have
shared bits in the SDR. Also, this method is flexible to support any number
of path elements (but if you have too many, you'll saturate the number of
unique representations available for unique URLs).

Hope that makes sense, and that it works decently. You probably have to
modify the algorithm a bit to make it work well, this is all just off the
top of my head.

- Chetan

[1] https://www.youtube.com/watch?v=_q5W2Ov6C9E

On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:

> I am tinkering with the CLA a bit and want to play around with web
> browsing history data.
>
> I'm trying to determine whether it would be feasible to predict the URL,
> or at least the top-level domain that is most likely to be visited next by
> a web surfer, based on their past browsing history. I might go so far as to
> make a multi-step prediction to short-circuit the navigation of a web
> surfer to directly the page they are interested in.
>
> First of all, I'm looking for feedback on whether this idea even makes
> sense as an application of the CLA, and whether anyone has tried something
> similar.
>
> Second, I'm a little bit stuck coming up with a good way to encode a URL
> for input to the SP. One thought is to break the URL into component fields
> (e.g., top-level domain, URL path and params). The problem is that the
> encoding should be adaptive and pick up values that have never been seen
> before. I'm uncertain how to approach this.
>
> Since there's no semantic similarity to be inferred between two different
> TLDs with similar names, a basic numeric encoding doesn't make sense.
>
> It might be reasonable to think that different URL paths with the same TLD
> and subdomain have some semantic similarity (e.g., maps.google.com/usa
>  and maps.google.com/canada are both maps). I would also suggest that if
> two URLs share some path elements, they are even more similar. So ideally,
> I would come up with an encoding that has little or no overlap for
> different TLDs, more overlap with same TLDs and subdomain, and even more if
> they have the same TLD, subdomain and share path elements.
>
> Thoughts?
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to