[nupic-discuss] Appropriate encoder(s) for URLs?

Julie Pitt Thu, 03 Apr 2014 16:44:16 -0700

I am tinkering with the CLA a bit and want to play around with web browsing
history data.


I'm trying to determine whether it would be feasible to predict the URL, or
at least the top-level domain that is most likely to be visited next by a
web surfer, based on their past browsing history. I might go so far as to
make a multi-step prediction to short-circuit the navigation of a web
surfer to directly the page they are interested in.

First of all, I'm looking for feedback on whether this idea even makes
sense as an application of the CLA, and whether anyone has tried something
similar.

Second, I'm a little bit stuck coming up with a good way to encode a URL
for input to the SP. One thought is to break the URL into component fields
(e.g., top-level domain, URL path and params). The problem is that the
encoding should be adaptive and pick up values that have never been seen
before. I'm uncertain how to approach this.

Since there's no semantic similarity to be inferred between two different
TLDs with similar names, a basic numeric encoding doesn't make sense.

It might be reasonable to think that different URL paths with the same TLD
and subdomain have some semantic similarity (e.g., maps.google.com/usa and
maps.google.com/canada are both maps). I would also suggest that if two
URLs share some path elements, they are even more similar. So ideally, I
would come up with an encoding that has little or no overlap for different
TLDs, more overlap with same TLDs and subdomain, and even more if they have
the same TLD, subdomain and share path elements.

Thoughts?

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

[nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to