Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Matthew Taylor Tue, 08 Apr 2014 09:40:45 -0700

I was thinking, if you're going to break a URL into different fields, you
might it up into "protocol", "domain", "path", "query", and "hash":


https://www.example.com/path/to/resource.html?param1=1&param2=2#target

Would be:
protocol: https
domain: www.example.com
path: path/to/resource.html
query=param1=1&param2=2
hash=target

I'm not sure if that will help, but it is a typical URL breakout.

---------
Matt Taylor
OS Community Flag-Bearer
Numenta


On Tue, Apr 8, 2014 at 8:35 AM, Subutai Ahmad <[email protected]> wrote:

> Hi Julie,
>
> Just to add to everyone else's input, this is a great application area for
> CLA's. I did some similar work a couple of years ago and got pretty good
> results.
>
> In terms of encoders, the simplest is to just use the OPF and use the
> "string" field type instead of float. Every new string that is encountered
> will automatically get a new random representation.  With this scheme each
> new string will be treated as a completely unique token with no semantic
> similarity to other URL's.  You'll want to make sure the string doesn't
> contain extraneous stuff since any difference will lead to a new
> representation.
>
> You could break each URL into multiple fields as you suggested. Just make
> each one a separate CSV field and each field into a string type.  I think
> this will achieve an effect that is similar to Chetan's suggestion. In my
> experiment each URL represented a news article and had a natural "topic"
> associated with it such as "business" or "politics" so I had a "topic"
> field.
>
> For best results I would recommend starting with a smaller dataset with a
> relatively small number of unique strings and then work your way up from
> there.  The amount of data you need to get good results will grow fast as
> the number of unique strings increases.  You'll probably want to swarm on
> the dataset as the parameters may need to be quite different from the
> default hotgym parameters.
>
> I'm curious to see how this goes. Please send along your results and
> questions as you make progress!
>
> --Subutai
>
>
> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:
>
>>  I am tinkering with the CLA a bit and want to play around with web
>> browsing history data.
>>
>> I'm trying to determine whether it would be feasible to predict the URL,
>> or at least the top-level domain that is most likely to be visited next by
>> a web surfer, based on their past browsing history. I might go so far as to
>> make a multi-step prediction to short-circuit the navigation of a web
>> surfer to directly the page they are interested in.
>>
>> First of all, I'm looking for feedback on whether this idea even makes
>> sense as an application of the CLA, and whether anyone has tried something
>> similar.
>>
>> Second, I'm a little bit stuck coming up with a good way to encode a URL
>> for input to the SP. One thought is to break the URL into component fields
>> (e.g., top-level domain, URL path and params). The problem is that the
>> encoding should be adaptive and pick up values that have never been seen
>> before. I'm uncertain how to approach this.
>>
>> Since there's no semantic similarity to be inferred between two different
>> TLDs with similar names, a basic numeric encoding doesn't make sense.
>>
>> It might be reasonable to think that different URL paths with the same
>> TLD and subdomain have some semantic similarity (e.g.,
>> maps.google.com/usa and maps.google.com/canada are both maps). I would
>> also suggest that if two URLs share some path elements, they are even more
>> similar. So ideally, I would come up with an encoding that has little or no
>> overlap for different TLDs, more overlap with same TLDs and subdomain, and
>> even more if they have the same TLD, subdomain and share path elements.
>>
>> Thoughts?
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to