Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Subutai Ahmad Tue, 08 Apr 2014 16:49:34 -0700

Sure, I can look that up.  I need to dig around for it.

--Subutai



On Tue, Apr 8, 2014 at 12:52 PM, Marek Otahal <[email protected]> wrote:

> Subutai,
> do you think you'd still dig up some paper or data from your prev.
> experiments? Would be interesting!
> Cheers,
>
>
> On Tue, Apr 8, 2014 at 5:35 PM, Subutai Ahmad <[email protected]> wrote:
>
>> Hi Julie,
>>
>> Just to add to everyone else's input, this is a great application area
>> for CLA's. I did some similar work a couple of years ago and got pretty
>> good results.
>>
>> In terms of encoders, the simplest is to just use the OPF and use the
>> "string" field type instead of float. Every new string that is encountered
>> will automatically get a new random representation.  With this scheme each
>> new string will be treated as a completely unique token with no semantic
>> similarity to other URL's.  You'll want to make sure the string doesn't
>> contain extraneous stuff since any difference will lead to a new
>> representation.
>>
>> You could break each URL into multiple fields as you suggested. Just make
>> each one a separate CSV field and each field into a string type.  I think
>> this will achieve an effect that is similar to Chetan's suggestion. In my
>> experiment each URL represented a news article and had a natural "topic"
>> associated with it such as "business" or "politics" so I had a "topic"
>> field.
>>
>> For best results I would recommend starting with a smaller dataset with a
>> relatively small number of unique strings and then work your way up from
>> there.  The amount of data you need to get good results will grow fast as
>> the number of unique strings increases.  You'll probably want to swarm on
>> the dataset as the parameters may need to be quite different from the
>> default hotgym parameters.
>>
>> I'm curious to see how this goes. Please send along your results and
>> questions as you make progress!
>>
>> --Subutai
>>
>>
>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:
>>
>>>  I am tinkering with the CLA a bit and want to play around with web
>>> browsing history data.
>>>
>>> I'm trying to determine whether it would be feasible to predict the URL,
>>> or at least the top-level domain that is most likely to be visited next by
>>> a web surfer, based on their past browsing history. I might go so far as to
>>> make a multi-step prediction to short-circuit the navigation of a web
>>> surfer to directly the page they are interested in.
>>>
>>> First of all, I'm looking for feedback on whether this idea even makes
>>> sense as an application of the CLA, and whether anyone has tried something
>>> similar.
>>>
>>> Second, I'm a little bit stuck coming up with a good way to encode a URL
>>> for input to the SP. One thought is to break the URL into component fields
>>> (e.g., top-level domain, URL path and params). The problem is that the
>>> encoding should be adaptive and pick up values that have never been seen
>>> before. I'm uncertain how to approach this.
>>>
>>> Since there's no semantic similarity to be inferred between two
>>> different TLDs with similar names, a basic numeric encoding doesn't make
>>> sense.
>>>
>>> It might be reasonable to think that different URL paths with the same
>>> TLD and subdomain have some semantic similarity (e.g.,
>>> maps.google.com/usa and maps.google.com/canada are both maps). I would
>>> also suggest that if two URLs share some path elements, they are even more
>>> similar. So ideally, I would come up with an encoding that has little or no
>>> overlap for different TLDs, more overlap with same TLDs and subdomain, and
>>> even more if they have the same TLD, subdomain and share path elements.
>>>
>>> Thoughts?
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
>
> --
> Marek Otahal :o)
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to