Re: [nupic-discuss] Appropriate encoder(s) for URLs?

SeH Thu, 03 Apr 2014 17:31:27 -0700

there must be several ways of representing text and streams of text in
systems like NuPIC and other reinforcement learning algorithms.  here's a
few links i found just now.

https://github.com/numenta/nupic/wiki/Natural-Language-Processing
http://people.csail.mit.edu/regina/my_papers/RL.pdf
https://code.google.com/p/word2vec/

i was looking into tools for building autonomous web browsers.  PhantomJS
is a headless webkit engine that can be used for UI testing, web scraping,
and generating screenshots.  this
project<https://github.com/pongells/collaborative-virtual-browser>uses
phantomJS to implement a multi-user collaborative browsing environment
that proxies client events to a phantomJS server.  i figured this would be
a good start for an AI browser that could learn from watching human
interaction.  since it can record browsing behavior including which
websites are visited, which links or buttons are clicked, what text is
entered into forms, and any "labels" a human might voluntarily apply to the
current situation ("i am doing ____"), it could possibly:

   - replay learned interaction sequences on command
   - predict next actions
   - characterize user acitivity for comparing or matching with other users

(a command shell could also be added to the AI browser so it could learn
sequences of system commands and their expected results, stored along with
the web browsing activity.)

another approach to doing something similar is implemented in
SikuliX<http://www.sikulix.com/>but it operates on recognizing visual
bitmaps.  the AI web browser would
have direct access to the HTML DOM.

so what can we do with the massive amount of data this would generate?  a
reinforcement learning system (like OpenBECCA <http://openbecca.org>) is
defined by a vector of sensors, a vector of actions, and a reward
function.

one input representation i'm imagining would present the current system
state as a set of text windows that scroll a stream of content up to a
finite history.  this would resemble several text console displays where
the lines disappear when new input arrives.  each "text pixel" of a window
could consist of the 8-bits of its UTF-8 encoding.  the total input seen by
the agent would consist of multiple of these scrolling windows representing
all aspects of the system activity, for example: Received HTML (raw), User
Interaction Events, Page Title, Page URL.

another way to represent input or output text could consist of a list of
integers (each capable of representing an index into a dictionary of
strings) which would form a string by concatenating each of its elements.
the dictionary could include separate symbols for URL characters like
/,:,?,=... then the terms that they connect would be unique symbols.  new
terms could be added dynamically, assuming there is room remaining.

[http][:][/][/][www][.][numenta][.][com]

On Thu, Apr 3, 2014 at 7:43 PM, Julie Pitt <[email protected]> wrote:

> I am tinkering with the CLA a bit and want to play around with web
> browsing history data.
>
> I'm trying to determine whether it would be feasible to predict the URL,
> or at least the top-level domain that is most likely to be visited next by
> a web surfer, based on their past browsing history. I might go so far as to
> make a multi-step prediction to short-circuit the navigation of a web
> surfer to directly the page they are interested in.
>
> First of all, I'm looking for feedback on whether this idea even makes
> sense as an application of the CLA, and whether anyone has tried something
> similar.
>
> Second, I'm a little bit stuck coming up with a good way to encode a URL
> for input to the SP. One thought is to break the URL into component fields
> (e.g., top-level domain, URL path and params). The problem is that the
> encoding should be adaptive and pick up values that have never been seen
> before. I'm uncertain how to approach this.
>
> Since there's no semantic similarity to be inferred between two different
> TLDs with similar names, a basic numeric encoding doesn't make sense.
>
> It might be reasonable to think that different URL paths with the same TLD
> and subdomain have some semantic similarity (e.g., maps.google.com/usa
>  and maps.google.com/canada are both maps). I would also suggest that if
> two URLs share some path elements, they are even more similar. So ideally,
> I would come up with an encoding that has little or no overlap for
> different TLDs, more overlap with same TLDs and subdomain, and even more if
> they have the same TLD, subdomain and share path elements.
>
> Thoughts?
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to