Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Marek Otahal Fri, 04 Apr 2014 04:52:26 -0700

That's a very nice idea!

The challenging part would be, like you said, representing the URL -
parsing the right parts, and even harder reconstructing a predicted url
again.


Things like javascript links/php mess in urls (
domain.com/dir/page.php?var=value&bla=ble ) would be your enemies.

I suggest simplifying the problem (still cool applications) and focusing
only on (html) urls: sub.domain.com/dir/page.html
Then a URL would be encoded as MultiEncoder of 4 parts, where each is a
CategoryEncoder : {sub, domain, dir, page}

Another idea would be to predict not sequences of URLs, but take time/day
information into account. eg:
morning - news sites
afternoon 4pm (after work) - facebook
evening ...
friday - travel site


And you would not have to worry about parsing urls at all, either take only
domain.com into account, or take whole url as a string - and encode as a
CategoryEnc.


Cheers, and let us know your results! ;)



On Fri, Apr 4, 2014 at 2:42 AM, Chetan Surpur <[email protected]> wrote:

> Julie,
>
> I think this web browsing prediction is a great project idea. It's
> inherently spatial and temporal in its patterns. I'd love to see what you
> can come up with, and what kinds of predictions can be made by the CLA over
> large datasets.
>
> Out of curiosity, what datasets are you planning to training on?
>
> Regarding a URL encoder, my gut instinct is that an encoder that operates
> in a way similar to the Random Distributed Scalar Encoder might work well.
> (For details on the RDSE, see this presentation: [1]). We'll have to adapt
> it to work with URLs as you described.
>
> I haven't yet worked out the exact algorithm, but I imagine it to look
> something like this:
>
> 1. Start with the domain. If it is new, generate a random SDR for it, and
> remember the SDR. If it has been seen before, look up the remembered SDR.
> 2. Look at the subdomain. Select a subset of the bits from the SDR above,
> and replace them with new ones to represent the subdomain. If the subdomain
> is new, the new bits will be randomly generated and remembered. If the
> domain has been seen before, look up the remembered bits and use those for
> the replacement. (The number of bits in the subset will depend on how much
> overlap you want between similar URLs.)
> 3. Repeat #2 using the next path element in the hierarchy.
>
> Follow this process until you have gone through the entire URL. This way,
> URLs that share path elements as you move down the hierarchy will have
> shared bits in the SDR. Also, this method is flexible to support any number
> of path elements (but if you have too many, you'll saturate the number of
> unique representations available for unique URLs).
>
> Hope that makes sense, and that it works decently. You probably have to
> modify the algorithm a bit to make it work well, this is all just off the
> top of my head.
>
> - Chetan
>
> [1] https://www.youtube.com/watch?v=_q5W2Ov6C9E
>
>
> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:
>
>> I am tinkering with the CLA a bit and want to play around with web
>> browsing history data.
>>
>> I'm trying to determine whether it would be feasible to predict the URL,
>> or at least the top-level domain that is most likely to be visited next by
>> a web surfer, based on their past browsing history. I might go so far as to
>> make a multi-step prediction to short-circuit the navigation of a web
>> surfer to directly the page they are interested in.
>>
>> First of all, I'm looking for feedback on whether this idea even makes
>> sense as an application of the CLA, and whether anyone has tried something
>> similar.
>>
>> Second, I'm a little bit stuck coming up with a good way to encode a URL
>> for input to the SP. One thought is to break the URL into component fields
>> (e.g., top-level domain, URL path and params). The problem is that the
>> encoding should be adaptive and pick up values that have never been seen
>> before. I'm uncertain how to approach this.
>>
>> Since there's no semantic similarity to be inferred between two different
>> TLDs with similar names, a basic numeric encoding doesn't make sense.
>>
>> It might be reasonable to think that different URL paths with the same
>> TLD and subdomain have some semantic similarity (e.g.,
>> maps.google.com/usa and maps.google.com/canada are both maps). I would
>> also suggest that if two URLs share some path elements, they are even more
>> similar. So ideally, I would come up with an encoding that has little or no
>> overlap for different TLDs, more overlap with same TLDs and subdomain, and
>> even more if they have the same TLD, subdomain and share path elements.
>>
>> Thoughts?
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 
Marek Otahal :o)

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to