Re: [nupic-dev] Many-steps prediction

Pedro Tabacof Thu, 10 Oct 2013 14:25:00 -0700

Hello Subutai,

Don't worry about that, your thoughtful answers are well worth the wait.


1) Creating a data folder in the same directory seems simple enough, I will
try to do it when I get back home.

2) I've triple checked everything, tried to redo it from scratch, but I
always get the same error. The only way I managed to make it work was to
rename the predicted field to "consumption" (just like the hotgym example),
but then even though the search worked, the resulting MAPE was zero, so
something clearly went wrong.

I will try to do (1) and (2) again, and if it still doesn't work, I will
post my code here.

Anyways, I've been experimenting with the half-hourly data. It is much
slower so I only have results for the 1488-step prediction (1488 = 31d *
48hh), and it actually improved the best results so far from 3.3% to 3.2%.
It is still far from the best result of the competition (2%), so I want try
to some new strategies. I will try to reduce the data size by resampling it
hourly and by excluding summertime data, and then I will try to do
48,...,1488-steps predictions, which will take a long time.

Do you know how "pamLength" should relate to the sequence size? I set it
manually to 48, since that is a whole day, but I don't know if this makes
sense. Is there any other parameter I should change to improve long-term
learning?

Thanks,
Pedro.


On Thu, Oct 10, 2013 at 3:10 PM, Subutai Ahmad <[email protected]> wrote:

> Hi Pedro,
>
> Apologies for the delayed response. I was pretty swamped with personal and
> Grok stuff. I'm not as familiar with the details of the swarming code,
> but here's what I've found:
>
> 1) The code uses NuPIC's findDataset command to locate the file. One way
> to add a new data file is to add the directory to NTA_DATA_PATH. If you
> don't want to mess with environment variables you can also add a
> subdirectory called "data" in the same location as your search_def.json.
>  In both these cases you can specify the filename as:
>
> "file://small_test.csv"
>
> There seems to be a bug in the swarming logic where you can't specify
> absolute paths (findDataset works fine with absolute paths, so the bug is
> somewhere else).
>
> 2) Field names are specified in a couple of places. I created a new small
> test file with different field names. The first few lines of this file are:
>
> dttm,value
> datetime,float
> T,
> 2013-07-09 12:05:00.0,117.0666667
> 2013-07-09 12:20:00.0,118.6666667
> 2013-07-09 12:35:00.0,120.0666667
>
> I put a new search_def.json in a separate directory under ~/tmp and the
> above file in a subdirectory called "data".  I then used the
> search_def.json below and it ran fine.  I put my changes in red so you can
> see what's going on.  I also updated the "Running Swarms" wiki page with
> this information.
>
> I don't think I answered your latest question. I'm not sure what is going
> on there but you may want to double check the field names are consistent
> everywhere. If you're still having problems can you email your
> search_def.json and the first few lines of the file?
>
>
> --Subutai
>
> {
>   "includedFields": [
>     {
>       "fieldName": "dttm",
>       "fieldType": "datetime"
>     },
>     {
>       "fieldName": "value",
>       "fieldType": "float"
>     }
>   ],
>   "streamDef": {
>     "info": "test",
>     "version": 1,
>     "streams": [
>       {
>         "info": "hotGym.csv",
>         "source": "file://small_test.csv",
>         "columns": [
>           "*"
>         ],
>         "last_record": 100
>       }
>     ],
>     "aggregation": {
>       "hours": 1,
>       "microseconds": 0,
>       "seconds": 0,
>       "fields": [
>         [
>           "value",
>           "sum"
>         ],
>         # Note: I removed the lines referring to the field 'gym' which is
>         # no longer present
>         [
>           "dttm",
>           "first"
>         ]
>       ],
>       "weeks": 0,
>       "months": 0,
>       "minutes": 0,
>       "days": 0,
>       "milliseconds": 0,
>       "years": 0
>     }
>   },
>   "inferenceType": "MultiStep",
>   "inferenceArgs": {
>     "predictionSteps": [
>       1
>     ],
>     "predictedField": "value"
>   },
>   "iterationCount": -1,
>   "swarmSize": "medium"
> }
>
>
>
>
> On Sat, Oct 5, 2013 at 3:57 PM, Pedro Tabacof <[email protected]> wrote:
>
>> Just a quick update, I've managed to set the timestamp field with the
>> same format as the hotgym example, but now I'm getting this error:
>>
>> Model Exception: Exception occurred while running model 1146:
>> Exception(u'No such input field: load'
>> ,) (<type 'exceptions.Exception'>)
>>
>> Pedro.
>>
>>
>> On Sat, Oct 5, 2013 at 7:36 PM, Pedro Tabacof <[email protected]> wrote:
>>
>>> Hello Subutai,
>>>
>>> I've two years worth of data, so that means 730 max loads and 35040
>>> half-hourly loads. Besides only using 730 samples, another problem is that
>>> the data is highly "seasoned": the competition winners actually discarded
>>> summer data since the prediction target was only January.
>>>
>>> I'm having some problems with swarming:
>>>
>>> 1) I've tried many different naming schemes but run_swarm.py never finds
>>> my data file. The only way I managed was to rename my file to "hotgym.csv"
>>> and use the same path as the "simple" example.
>>>
>>> 2) What is the expected datetime format? Is there a way to change it? I
>>> just cannot set my Excel to write dates as YYYY-MM-DD hh:mm:ss and I'm
>>> using MM/DD/YYYY.
>>>
>>> I don't if it's related to (2), but my swarming fails with:
>>>
>>> ERROR MESSAGE: Exception occurred while running model 1139:
>>> KeyError('load',) (<type 'exceptions.Key Error'>)
>>>
>>> ("load" is the prediction objective)
>>>
>>>
>>> Thanks again!
>>> Pedro.
>>>
>>>
>>>
>>> On Fri, Oct 4, 2013 at 9:43 PM, Subutai Ahmad <[email protected]>wrote:
>>>
>>>> Hi Pedro,
>>>>
>>>> Doing Monte Carlo simulation is a great idea for multi-steps. I guess
>>>> one concern is that the number of possibilities grows exponentially the
>>>> longer you look into the future. The simulation time will similarly grow
>>>> exponentially. Still, for a small number of steps it could work well.
>>>>
>>>> For predicting peak load, I think your current approach is pretty good.
>>>> The big drawback as you mentioned is that it reduces the number of data
>>>> points by a factor of 48. How much data do you have? Internally we use a
>>>> rule of thumb where we like to have at least a thousand records to get
>>>> decent results.
>>>>
>>>> The other possible approach is to create a 48-step ahead model and feed
>>>> it half hour data (swarm on this configuration if possible). Then you can
>>>> accumulate the predictions as you go along. So, by midnight Tuesday, you
>>>> should have all the predictions for Wednesday and you can take the peak
>>>> one.  This will allow you to use all the data. You can use the same
>>>> approach for 2 days ahead, etc. I'm not actually sure if this will do
>>>> better than your approach, but thought I'd throw it out there.
>>>>
>>>> --Subutai
>>>>
>>>>
>>>>
>>>> On Fri, Oct 4, 2013 at 6:04 AM, Pedro Tabacof <[email protected]>wrote:
>>>>
>>>>> Hello Subutai,
>>>>>
>>>>> Since it was quite easy to do, I ended up trying to feed back the
>>>>> prediction back to the input. While the results were worse than doing
>>>>> 31-step or 1,...,31-step predicitons, it wasn't terrible. Like you said,
>>>>> the simulation degraded with time, but in the end it was still within an
>>>>> acceptable range. Maybe it'd be interesting to research this problem under
>>>>> a Monte Carlo approach, repeating the simulation many times using 
>>>>> different
>>>>> predictions and calculating the final prediciton expectation.
>>>>>
>>>>> I raised this question because on this problem I have to predict the
>>>>> max energy load of each day, however I have half-hourly data, so I'm
>>>>> actually discarding a lot of samples to feed to the CLA just the max load
>>>>> of each day. My idea is to use the half-hourly data and then do this
>>>>> prediction feedback so I can predict the half-hourly energy load for the
>>>>> whole month, and then I can take the max load of each day by hand. I still
>>>>> haven't done this because this is gonna be much more challenging, but it 
>>>>> is
>>>>> worth the shot even if it is just for "scientific" reasons.
>>>>>
>>>>> Do you have any ideia on how to use the half-hourly data in a sensible
>>>>> way?
>>>>>
>>>>> Your suggestion to do swarming on 31 different models is great, I was
>>>>> just stuck thinking of doing only the 1,...,31-step predictions with one
>>>>> single model, but as you said the classifier uses a lot a of memory this
>>>>> way and ends up being much slower than it'd be with separate models. I 
>>>>> will
>>>>> try to get swarming running on the VM and then try to do this, it seems
>>>>> like the best shot for a good result.
>>>>>
>>>>> Thanks a lot, it was really helpful!
>>>>>
>>>>> Pedro.
>>>>>
>>>>>
>>>>> On Thu, Oct 3, 2013 at 5:32 PM, Subutai Ahmad <[email protected]>wrote:
>>>>>
>>>>>> Hi Pedro,
>>>>>>
>>>>>> That's encouraging news!  Having your results documented will be
>>>>>> really helpful to everyone.  Here's an attempt to answer your main 
>>>>>> question:
>>>>>>
>>>>>> 1) My feeling is similar to yours - in general I don't think
>>>>>> recursively feeding in classifier predictions is a good idea for 
>>>>>> predicting
>>>>>> many steps ahead. There are multiple predictions made at each time step.
>>>>>> These predictions branch into the future and weird things can happen.
>>>>>> Suppose we fed in the most likely prediction at each time step.  Here's a
>>>>>> simple failure case:
>>>>>>
>>>>>> A  -> B (0.4) -> D (0.1)
>>>>>>  |---> C (0.3) -> E (1.0)
>>>>>>
>>>>>> In this data, after A you get B with 40% chance and C with 30%
>>>>>> chance. After B the most likely element is D but it only has 10% chance. 
>>>>>> E
>>>>>> always follows C with 100% probability.  If you feed the most likely
>>>>>> prediction from A back into the system, you would predict D two steps
>>>>>> ahead. However, E is a better 2-step prediction starting from A.
>>>>>>
>>>>>> Other issues can happen. Quite often the probabilities for the
>>>>>> various predictions are quite similar. If you just follow the most likely
>>>>>> path then a small mistake (e.g. a small amount of noise) could throw it
>>>>>> off.   If you could somehow feed in all the probabilities at each time 
>>>>>> step
>>>>>> then maybe you can do a better job but that would be a lot more involved
>>>>>> and I'm not really sure how to do it with CLA.
>>>>>>
>>>>>>
>>>>>> For multi step predictions we have tried the following options:
>>>>>>
>>>>>> a) For x=1 .. 31, train 31 different models, each predicting x steps
>>>>>> ahead. Each model is swarmed specifically for x.  This gives the best
>>>>>> results since the parameters for predicting one month into the future 
>>>>>> could
>>>>>> be different from 1 day into the future. It sounds similar to what you 
>>>>>> did
>>>>>> except for custom swarming. Unfortunately, this is the most time 
>>>>>> consuming
>>>>>> because of the swarming step. Once you get swarming working, you might 
>>>>>> want
>>>>>> to try this with just one 7 step ahead model and see if that is better 
>>>>>> than
>>>>>> your current 7 step model.
>>>>>>
>>>>>> b) Train one model to predict 31 days ahead and accumulate the
>>>>>> results to get all the predictions. So, tomorrow's prediction would have
>>>>>> been made 30 days ago by this model. Surprisingly, in some situations 
>>>>>> with
>>>>>> very regular data this works pretty well.  Quite often it's not as good 
>>>>>> as
>>>>>> a).
>>>>>>
>>>>>> c) A combination of the above. For example, train 3 models to predict
>>>>>> 1 day, 7 days, and 31 days in advance. Accumulate using the closest 
>>>>>> models.
>>>>>> This is a compromise that can work well.
>>>>>>
>>>>>> d) Train a single model to predict 1, 2, 3, …, 31 steps ahead (i.e.
>>>>>> all of them). You can do this by specifying a list of steps for steps
>>>>>> ahead. We've had problems with this though.  The classifier can take up a
>>>>>> lot of memory in this setup. Also, often a single set of parameters 
>>>>>> doesn't
>>>>>> work well for all time ranges.
>>>>>>
>>>>>>
>>>>>> Other questions:
>>>>>>
>>>>>> 2) It should. Scott might know better.
>>>>>>
>>>>>> 3) I don't know - again Scott might know this. If I remember
>>>>>> correctly finishLearning is just an optimization step so you can ignore 
>>>>>> it.
>>>>>> Turning learning off with disableLearning should work for testing.
>>>>>>
>>>>>> 4) Yes, you can run swarming within the VM. The main extra step is
>>>>>> that you need to install MySQL. There is a test script in "python
>>>>>> examples/swarm/test_db.py" to test that the DB is working. If that works
>>>>>> swarming should work. See
>>>>>> https://github.com/numenta/nupic/wiki/Running-Swarms for details.
>>>>>>
>>>>>> This ended up being a really long email!  Hopefully it was helpful.
>>>>>>
>>>>>> --Subutai
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 3, 2013 at 9:13 AM, Pedro Tabacof <[email protected]>wrote:
>>>>>>
>>>>>>> Matt, I haven't uploaded my code anywhere yet. I'd like to try more
>>>>>>> a few more things (which depend on the questions I asked) before I do 
>>>>>>> this
>>>>>>> because I know when I upload the code and post the results here I 
>>>>>>> probably
>>>>>>> won't try to improve or change anything. I only work well under pressure
>>>>>>> lol.
>>>>>>>
>>>>>>> Since I'm gonna be away this weekend, I hope that by the end of next
>>>>>>> week I will set up a github page with everything (explanation of the
>>>>>>> problem, dataset, code and results with competition comparisons).
>>>>>>>
>>>>>>> Pedro.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 3, 2013 at 12:56 PM, Matthew Taylor <[email protected]>wrote:
>>>>>>>
>>>>>>>> Pedro, this is exciting! Is your code available online anywhere?
>>>>>>>> Any chance you can put it up on github or bitbucket?
>>>>>>>>
>>>>>>>> ---------
>>>>>>>> Matt Taylor
>>>>>>>> OS Community Flag-Bearer
>>>>>>>> Numenta
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 3, 2013 at 6:59 AM, Pedro Tabacof <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I've been working with an energy competition dataset [1] and I've
>>>>>>>>> been experimenting with some different ways to predict many steps 
>>>>>>>>> ahead (I
>>>>>>>>> have to predict 31 different energy loads for the whole month). This 
>>>>>>>>> led me
>>>>>>>>> to some questions:
>>>>>>>>>
>>>>>>>>> 1) Has anyone tried feeding one-step classifier predictions back
>>>>>>>>> to the input? This can be done easily by hand but I'm not sure if 
>>>>>>>>> this is a
>>>>>>>>> good idea for many steps prediction.
>>>>>>>>>
>>>>>>>>> 2) Does "disableLearning" also turn off classifier learning? If
>>>>>>>>> not, how do I do it?
>>>>>>>>>
>>>>>>>>> 3) Is "finishLearning" deprecated? I tried using it but I got an
>>>>>>>>> error message.
>>>>>>>>>
>>>>>>>>> 4) Is it possible run swarming within the Vagrant VM? What about
>>>>>>>>> Cerebro?
>>>>>>>>>
>>>>>>>>> On a side note, so far I have achieved 3.3% MAPE on the test data,
>>>>>>>>> which would put me among the top 10 competitors (out of 26), with 
>>>>>>>>> pretty
>>>>>>>>> much the basic NuPIC configuration, very similar to the hotgym 
>>>>>>>>> example.
>>>>>>>>>
>>>>>>>>> I have experimented with 31-step predictions and 1,2,3,...,31
>>>>>>>>> predictions, but this was too slow and didn't improve the results. 
>>>>>>>>> When I
>>>>>>>>> finish testing all my ideas, I will post my results and experience 
>>>>>>>>> here.
>>>>>>>>>
>>>>>>>>> Pedro.
>>>>>>>>>
>>>>>>>>> [1] http://neuron.tuke.sk/competition/index.php
>>>>>>>>> --
>>>>>>>>> Pedro Tabacof,
>>>>>>>>> Unicamp - Eng. de Computação 08.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> nupic mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> nupic mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Pedro Tabacof,
>>>>>>> Unicamp - Eng. de Computação 08.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pedro Tabacof,
>>>>> Unicamp - Eng. de Computação 08.
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Pedro Tabacof,
>>> Unicamp - Eng. de Computação 08.
>>>
>>
>>
>>
>> --
>> Pedro Tabacof,
>> Unicamp - Eng. de Computação 08.
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 
Pedro Tabacof,
Unicamp - Eng. de Computação 08.

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] Many-steps prediction

Reply via email to