Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Julie Pitt Thu, 10 Apr 2014 15:38:29 -0700

That's it! Thank you.


On Thu, Apr 10, 2014 at 3:28 PM, Matthew Taylor <[email protected]> wrote:

> I think this is what you're looking for:
> https://github.com/subutai/nupic.subutai/tree/master/swarm_examples
>
> ---------
> Matt Taylor
> OS Community Flag-Bearer
> Numenta
>
>
> On Thu, Apr 10, 2014 at 3:14 PM, Julie Pitt <[email protected]> wrote:
>
>> OK, I think I have my data set up to swarm over. I ended up creating
>> different fields for tld, domain, port, subdomain(s) and up to 6 path
>> elements (not sure yet if that's too much). At some point I thought someone
>> posted either a youtube video or a link to github with an example swarm
>> that uses multiple fields in the SDR and creates a model that predicts
>> those fields. If anyone is aware of such an example, I'd appreciate a
>> pointer!
>>
>>
>> On Wed, Apr 9, 2014 at 11:30 AM, Julie Pitt <[email protected]> wrote:
>>
>>> Thanks all for your very helpful suggestions. I'm working on this when I
>>> have scraps of time, so apologies for my latent responses. Subutai/Matt, I
>>> was thinking the same thing in terms of parsing a URL into its main
>>> components and encoding each one separately. And yes, there is the problem
>>> of not knowing how many URL path elements there will be. Perhaps a first
>>> cut is to just arbitrarily limit it and throw away the rest. I would also
>>> throw away URL params initially.
>>>
>>> Using a random encoder, there would obviously be no real semantic
>>> understanding of what these URLs mean, but I'm wondering if some level of
>>> understanding could be achieved by using multiple sensory regions for
>>> different parts of the URL and then forming a 2-level hierarchy to identify
>>> and predict sequences. If I can get anything interesting out of pure URL
>>> data, I would want to add temporal data to see if any predictions could be
>>> made in terms of a user's behavior throughout the day, week and year.
>>> (i.e., it's holiday season, so you'll probably be cruising Amazon).
>>>
>>> Back to Chetan's question about where I'm getting my data, I am
>>> harvesting my own browsing history from Chrome's sqlite DB. So, this is
>>> really a toy. I may conscript a few friends to share with me their history.
>>> Any volunteers? :-)
>>>
>>> As far as training goes, I'm thinking there's in pooling history from
>>> many people on which to learn "default" behaviors, and then keep learning
>>> turned on to get a feel for how that particular user behaves. The problem
>>> is how to feed the data in, because at that point you get interleaved time
>>> steps that are unrelated. There probably needs to be a concept of a user in
>>> there.
>>>
>>>
>>> On Tue, Apr 8, 2014 at 4:48 PM, Subutai Ahmad <[email protected]>wrote:
>>>
>>>>
>>>> Sure, I can look that up.  I need to dig around for it.
>>>>
>>>> --Subutai
>>>>
>>>>
>>>> On Tue, Apr 8, 2014 at 12:52 PM, Marek Otahal <[email protected]>wrote:
>>>>
>>>>> Subutai,
>>>>> do you think you'd still dig up some paper or data from your prev.
>>>>> experiments? Would be interesting!
>>>>> Cheers,
>>>>>
>>>>>
>>>>> On Tue, Apr 8, 2014 at 5:35 PM, Subutai Ahmad <[email protected]>wrote:
>>>>>
>>>>>> Hi Julie,
>>>>>>
>>>>>> Just to add to everyone else's input, this is a great application
>>>>>> area for CLA's. I did some similar work a couple of years ago and got
>>>>>> pretty good results.
>>>>>>
>>>>>> In terms of encoders, the simplest is to just use the OPF and use the
>>>>>> "string" field type instead of float. Every new string that is 
>>>>>> encountered
>>>>>> will automatically get a new random representation.  With this scheme 
>>>>>> each
>>>>>> new string will be treated as a completely unique token with no semantic
>>>>>> similarity to other URL's.  You'll want to make sure the string doesn't
>>>>>> contain extraneous stuff since any difference will lead to a new
>>>>>> representation.
>>>>>>
>>>>>> You could break each URL into multiple fields as you suggested. Just
>>>>>> make each one a separate CSV field and each field into a string type.  I
>>>>>> think this will achieve an effect that is similar to Chetan's suggestion.
>>>>>> In my experiment each URL represented a news article and had a natural
>>>>>> "topic" associated with it such as "business" or "politics" so I had a
>>>>>> "topic" field.
>>>>>>
>>>>>> For best results I would recommend starting with a smaller dataset
>>>>>> with a relatively small number of unique strings and then work your way 
>>>>>> up
>>>>>> from there.  The amount of data you need to get good results will grow 
>>>>>> fast
>>>>>> as the number of unique strings increases.  You'll probably want to swarm
>>>>>> on the dataset as the parameters may need to be quite different from the
>>>>>> default hotgym parameters.
>>>>>>
>>>>>> I'm curious to see how this goes. Please send along your results and
>>>>>> questions as you make progress!
>>>>>>
>>>>>> --Subutai
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 3, 2014 at 4:43 PM, Julie Pitt <[email protected]>wrote:
>>>>>>
>>>>>>>  I am tinkering with the CLA a bit and want to play around with web
>>>>>>> browsing history data.
>>>>>>>
>>>>>>> I'm trying to determine whether it would be feasible to predict the
>>>>>>> URL, or at least the top-level domain that is most likely to be visited
>>>>>>> next by a web surfer, based on their past browsing history. I might go 
>>>>>>> so
>>>>>>> far as to make a multi-step prediction to short-circuit the navigation 
>>>>>>> of a
>>>>>>> web surfer to directly the page they are interested in.
>>>>>>>
>>>>>>> First of all, I'm looking for feedback on whether this idea even
>>>>>>> makes sense as an application of the CLA, and whether anyone has tried
>>>>>>> something similar.
>>>>>>>
>>>>>>> Second, I'm a little bit stuck coming up with a good way to encode a
>>>>>>> URL for input to the SP. One thought is to break the URL into component
>>>>>>> fields (e.g., top-level domain, URL path and params). The problem is 
>>>>>>> that
>>>>>>> the encoding should be adaptive and pick up values that have never been
>>>>>>> seen before. I'm uncertain how to approach this.
>>>>>>>
>>>>>>> Since there's no semantic similarity to be inferred between two
>>>>>>> different TLDs with similar names, a basic numeric encoding doesn't make
>>>>>>> sense.
>>>>>>>
>>>>>>> It might be reasonable to think that different URL paths with the
>>>>>>> same TLD and subdomain have some semantic similarity (e.g.,
>>>>>>> maps.google.com/usa and maps.google.com/canada are both maps). I
>>>>>>> would also suggest that if two URLs share some path elements, they are 
>>>>>>> even
>>>>>>> more similar. So ideally, I would come up with an encoding that has 
>>>>>>> little
>>>>>>> or no overlap for different TLDs, more overlap with same TLDs and
>>>>>>> subdomain, and even more if they have the same TLD, subdomain and share
>>>>>>> path elements.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Marek Otahal :o)
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] Appropriate encoder(s) for URLs?

Reply via email to