Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Matthew Taylor Tue, 08 Apr 2014 10:36:21 -0700

I think it depends on the variety of URLs Julie will be seeing. We can't
assume there will be much structure between domains of how the paths are
constructed. Some will be very organized, others won't have a path at all
(modern one-page-load HTML5 type sites). In either case, if the path is
broken into pieces, it will be a variable amount of pieces, so another
problem would arise: each url might have a different number of path
elements (ex: /news/sports vs /us/news/sports/football). It would be hard
to deal with this in NuPIC because we can't make global field changes
without new model params.


---------
Matt Taylor
OS Community Flag-Bearer
Numenta


On Tue, Apr 8, 2014 at 10:30 AM, Fergal Byrne
<[email protected]>wrote:

> Hi Matt,
>
> Great idea, but would definitely be best to split the path into separate
> fields, as the paths form a semantic hierarchy (/sport/* would all share
> meaning, /news/* would be another, with /news/local/* and /news/us/*,
> /news/eu/* etc being subhierarchies).
>
> Marek might be able to help with his VectorEncoder..?
>
> Regards,
>
> Fergal Byrne
>
>
> On Tue, Apr 8, 2014 at 5:39 PM, Matthew Taylor <[email protected]> wrote:
>
>> I was thinking, if you're going to break a URL into different fields, you
>> might it up into "protocol", "domain", "path", "query", and "hash":
>>
>> https://www.example.com/path/to/resource.html?param1=1&param2=2#target
>>
>> Would be:
>> protocol: https
>> domain: www.example.com
>> path: path/to/resource.html
>> query=param1=1&param2=2
>> hash=target
>>
>> I'm not sure if that will help, but it is a typical URL breakout.
>>
>> ---------
>> Matt Taylor
>> OS Community Flag-Bearer
>> Numenta
>>
>>
>> On Tue, Apr 8, 2014 at 8:35 AM, Subutai Ahmad <[email protected]>wrote:
>>
>>> Hi Julie,
>>>
>>> Just to add to everyone else's input, this is a great application area
>>> for CLA's. I did some similar work a couple of years ago and got pretty
>>> good results.
>>>
>>> In terms of encoders, the simplest is to just use the OPF and use the
>>> "string" field type instead of float. Every new string that is encountered
>>> will automatically get a new random representation.  With this scheme each
>>> new string will be treated as a completely unique token with no semantic
>>> similarity to other URL's.  You'll want to make sure the string doesn't
>>> contain extraneous stuff since any difference will lead to a new
>>> representation.
>>>
>>> You could break each URL into multiple fields as you suggested. Just
>>> make each one a separate CSV field and each field into a string type.  I
>>> think this will achieve an effect that is similar to Chetan's suggestion.
>>> In my experiment each URL represented a news article and had a natural
>>> "topic" associated with it such as "business" or "politics" so I had a
>>> "topic" field.
>>>
>>> For best results I would recommend starting with a smaller dataset with
>>> a relatively small number of unique strings and then work your way up from
>>> there.  The amount of data you need to get good results will grow fast as
>>> the number of unique strings increases.  You'll probably want to swarm on
>>> the dataset as the parameters may need to be quite different from the
>>> default hotgym parameters.
>>>
>>> I'm curious to see how this goes. Please send along your results and
>>> questions as you make progress!
>>>
>>> --Subutai
>>>
>>>
>>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]> wrote:
>>>
>>>>  I am tinkering with the CLA a bit and want to play around with web
>>>> browsing history data.
>>>>
>>>> I'm trying to determine whether it would be feasible to predict the
>>>> URL, or at least the top-level domain that is most likely to be visited
>>>> next by a web surfer, based on their past browsing history. I might go so
>>>> far as to make a multi-step prediction to short-circuit the navigation of a
>>>> web surfer to directly the page they are interested in.
>>>>
>>>> First of all, I'm looking for feedback on whether this idea even makes
>>>> sense as an application of the CLA, and whether anyone has tried something
>>>> similar.
>>>>
>>>> Second, I'm a little bit stuck coming up with a good way to encode a
>>>> URL for input to the SP. One thought is to break the URL into component
>>>> fields (e.g., top-level domain, URL path and params). The problem is that
>>>> the encoding should be adaptive and pick up values that have never been
>>>> seen before. I'm uncertain how to approach this.
>>>>
>>>> Since there's no semantic similarity to be inferred between two
>>>> different TLDs with similar names, a basic numeric encoding doesn't make
>>>> sense.
>>>>
>>>> It might be reasonable to think that different URL paths with the same
>>>> TLD and subdomain have some semantic similarity (e.g.,
>>>> maps.google.com/usa and maps.google.com/canada are both maps). I would
>>>> also suggest that if two URLs share some path elements, they are even more
>>>> similar. So ideally, I would come up with an encoding that has little or no
>>>> overlap for different TLDs, more overlap with same TLDs and subdomain, and
>>>> even more if they have the same TLD, subdomain and share path elements.
>>>>
>>>> Thoughts?
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
>
> --
>
> Fergal Byrne, Brenter IT
>
> Author, Real Machine Intelligence with Clortex and NuPIC
> https://leanpub.com/realsmartmachines
>
> <http://www.examsupport.ie>http://inbits.com - Better Living through
> Thoughtful Technology
> http://ie.linkedin.com/in/fergbyrne/
> https://github.com/fergalbyrne
>
> e:[email protected] t:+353 83 4214179
> Formerly of Adnet [email protected] http://www.adnet.ie
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to