Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Fergal Byrne Tue, 08 Apr 2014 10:31:38 -0700

Hi Matt,

Great idea, but would definitely be best to split the path into separate
fields, as the paths form a semantic hierarchy (/sport/* would all share
meaning, /news/* would be another, with /news/local/* and /news/us/*,
/news/eu/* etc being subhierarchies).


Marek might be able to help with his VectorEncoder..?

Regards,

Fergal Byrne


On Tue, Apr 8, 2014 at 5:39 PM, Matthew Taylor <[email protected]> wrote:

> I was thinking, if you're going to break a URL into different fields, you
> might it up into "protocol", "domain", "path", "query", and "hash":
>
> https://www.example.com/path/to/resource.html?param1=1&param2=2#target
>
> Would be:
> protocol: https
> domain: www.example.com
> path: path/to/resource.html
> query=param1=1&param2=2
> hash=target
>
> I'm not sure if that will help, but it is a typical URL breakout.
>
> ---------
> Matt Taylor
> OS Community Flag-Bearer
> Numenta
>
>
> On Tue, Apr 8, 2014 at 8:35 AM, Subutai Ahmad <[email protected]> wrote:
>
>> Hi Julie,
>>
>> Just to add to everyone else's input, this is a great application area
>> for CLA's. I did some similar work a couple of years ago and got pretty
>> good results.
>>
>> In terms of encoders, the simplest is to just use the OPF and use the
>> "string" field type instead of float. Every new string that is encountered
>> will automatically get a new random representation.  With this scheme each
>> new string will be treated as a completely unique token with no semantic
>> similarity to other URL's.  You'll want to make sure the string doesn't
>> contain extraneous stuff since any difference will lead to a new
>> representation.
>>
>> You could break each URL into multiple fields as you suggested. Just make
>> each one a separate CSV field and each field into a string type.  I think
>> this will achieve an effect that is similar to Chetan's suggestion. In my
>> experiment each URL represented a news article and had a natural "topic"
>> associated with it such as "business" or "politics" so I had a "topic"
>> field.
>>
>> For best results I would recommend starting with a smaller dataset with a
>> relatively small number of unique strings and then work your way up from
>> there.  The amount of data you need to get good results will grow fast as
>> the number of unique strings increases.  You'll probably want to swarm on
>> the dataset as the parameters may need to be quite different from the
>> default hotgym parameters.
>>
>> I'm curious to see how this goes. Please send along your results and
>> questions as you make progress!
>>
>> --Subutai
>>
>>
>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:
>>
>>>  I am tinkering with the CLA a bit and want to play around with web
>>> browsing history data.
>>>
>>> I'm trying to determine whether it would be feasible to predict the URL,
>>> or at least the top-level domain that is most likely to be visited next by
>>> a web surfer, based on their past browsing history. I might go so far as to
>>> make a multi-step prediction to short-circuit the navigation of a web
>>> surfer to directly the page they are interested in.
>>>
>>> First of all, I'm looking for feedback on whether this idea even makes
>>> sense as an application of the CLA, and whether anyone has tried something
>>> similar.
>>>
>>> Second, I'm a little bit stuck coming up with a good way to encode a URL
>>> for input to the SP. One thought is to break the URL into component fields
>>> (e.g., top-level domain, URL path and params). The problem is that the
>>> encoding should be adaptive and pick up values that have never been seen
>>> before. I'm uncertain how to approach this.
>>>
>>> Since there's no semantic similarity to be inferred between two
>>> different TLDs with similar names, a basic numeric encoding doesn't make
>>> sense.
>>>
>>> It might be reasonable to think that different URL paths with the same
>>> TLD and subdomain have some semantic similarity (e.g.,
>>> maps.google.com/usa and maps.google.com/canada are both maps). I would
>>> also suggest that if two URLs share some path elements, they are even more
>>> similar. So ideally, I would come up with an encoding that has little or no
>>> overlap for different TLDs, more overlap with same TLDs and subdomain, and
>>> even more if they have the same TLD, subdomain and share path elements.
>>>
>>> Thoughts?
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 

Fergal Byrne, Brenter IT

Author, Real Machine Intelligence with Clortex and NuPIC
https://leanpub.com/realsmartmachines

<http://www.examsupport.ie>http://inbits.com - Better Living through
Thoughtful Technology
http://ie.linkedin.com/in/fergbyrne/
https://github.com/fergalbyrne

e:[email protected] t:+353 83 4214179
Formerly of Adnet [email protected] http://www.adnet.ie

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to