Re: [nupic-dev] Many-steps prediction

Subutai Ahmad Thu, 10 Oct 2013 11:11:49 -0700

Hi Pedro,

Apologies for the delayed response. I was pretty swamped with personal and
Grok stuff. I'm not as familiar with the details of the swarming code, but
here's what I've found:


1) The code uses NuPIC's findDataset command to locate the file. One way to
add a new data file is to add the directory to NTA_DATA_PATH. If you don't
want to mess with environment variables you can also add a subdirectory
called "data" in the same location as your search_def.json.  In both these
cases you can specify the filename as:

"file://small_test.csv"

There seems to be a bug in the swarming logic where you can't specify
absolute paths (findDataset works fine with absolute paths, so the bug is
somewhere else).

2) Field names are specified in a couple of places. I created a new small
test file with different field names. The first few lines of this file are:

dttm,value
datetime,float
T,
2013-07-09 12:05:00.0,117.0666667
2013-07-09 12:20:00.0,118.6666667
2013-07-09 12:35:00.0,120.0666667

I put a new search_def.json in a separate directory under ~/tmp and the
above file in a subdirectory called "data".  I then used the
search_def.json below and it ran fine.  I put my changes in red so you can
see what's going on.  I also updated the "Running Swarms" wiki page with
this information.

I don't think I answered your latest question. I'm not sure what is going
on there but you may want to double check the field names are consistent
everywhere. If you're still having problems can you email your
search_def.json and the first few lines of the file?


--Subutai

{
  "includedFields": [
    {
      "fieldName": "dttm",
      "fieldType": "datetime"
    },
    {
      "fieldName": "value",
      "fieldType": "float"
    }
  ],
  "streamDef": {
    "info": "test",
    "version": 1,
    "streams": [
      {
        "info": "hotGym.csv",
        "source": "file://small_test.csv",
        "columns": [
          "*"
        ],
        "last_record": 100
      }
    ],
    "aggregation": {
      "hours": 1,
      "microseconds": 0,
      "seconds": 0,
      "fields": [
        [
          "value",
          "sum"
        ],
        # Note: I removed the lines referring to the field 'gym' which is
        # no longer present
        [
          "dttm",
          "first"
        ]
      ],
      "weeks": 0,
      "months": 0,
      "minutes": 0,
      "days": 0,
      "milliseconds": 0,
      "years": 0
    }
  },
  "inferenceType": "MultiStep",
  "inferenceArgs": {
    "predictionSteps": [
      1
    ],
    "predictedField": "value"
  },
  "iterationCount": -1,
  "swarmSize": "medium"
}




On Sat, Oct 5, 2013 at 3:57 PM, Pedro Tabacof <[email protected]> wrote:

> Just a quick update, I've managed to set the timestamp field with the same
> format as the hotgym example, but now I'm getting this error:
>
> Model Exception: Exception occurred while running model 1146:
> Exception(u'No such input field: load'
> ,) (<type 'exceptions.Exception'>)
>
> Pedro.
>
>
> On Sat, Oct 5, 2013 at 7:36 PM, Pedro Tabacof <[email protected]> wrote:
>
>> Hello Subutai,
>>
>> I've two years worth of data, so that means 730 max loads and 35040
>> half-hourly loads. Besides only using 730 samples, another problem is that
>> the data is highly "seasoned": the competition winners actually discarded
>> summer data since the prediction target was only January.
>>
>> I'm having some problems with swarming:
>>
>> 1) I've tried many different naming schemes but run_swarm.py never finds
>> my data file. The only way I managed was to rename my file to "hotgym.csv"
>> and use the same path as the "simple" example.
>>
>> 2) What is the expected datetime format? Is there a way to change it? I
>> just cannot set my Excel to write dates as YYYY-MM-DD hh:mm:ss and I'm
>> using MM/DD/YYYY.
>>
>> I don't if it's related to (2), but my swarming fails with:
>>
>> ERROR MESSAGE: Exception occurred while running model 1139:
>> KeyError('load',) (<type 'exceptions.Key Error'>)
>>
>> ("load" is the prediction objective)
>>
>>
>> Thanks again!
>> Pedro.
>>
>>
>>
>> On Fri, Oct 4, 2013 at 9:43 PM, Subutai Ahmad <[email protected]>wrote:
>>
>>> Hi Pedro,
>>>
>>> Doing Monte Carlo simulation is a great idea for multi-steps. I guess
>>> one concern is that the number of possibilities grows exponentially the
>>> longer you look into the future. The simulation time will similarly grow
>>> exponentially. Still, for a small number of steps it could work well.
>>>
>>> For predicting peak load, I think your current approach is pretty good.
>>> The big drawback as you mentioned is that it reduces the number of data
>>> points by a factor of 48. How much data do you have? Internally we use a
>>> rule of thumb where we like to have at least a thousand records to get
>>> decent results.
>>>
>>> The other possible approach is to create a 48-step ahead model and feed
>>> it half hour data (swarm on this configuration if possible). Then you can
>>> accumulate the predictions as you go along. So, by midnight Tuesday, you
>>> should have all the predictions for Wednesday and you can take the peak
>>> one.  This will allow you to use all the data. You can use the same
>>> approach for 2 days ahead, etc. I'm not actually sure if this will do
>>> better than your approach, but thought I'd throw it out there.
>>>
>>> --Subutai
>>>
>>>
>>>
>>> On Fri, Oct 4, 2013 at 6:04 AM, Pedro Tabacof <[email protected]> wrote:
>>>
>>>> Hello Subutai,
>>>>
>>>> Since it was quite easy to do, I ended up trying to feed back the
>>>> prediction back to the input. While the results were worse than doing
>>>> 31-step or 1,...,31-step predicitons, it wasn't terrible. Like you said,
>>>> the simulation degraded with time, but in the end it was still within an
>>>> acceptable range. Maybe it'd be interesting to research this problem under
>>>> a Monte Carlo approach, repeating the simulation many times using different
>>>> predictions and calculating the final prediciton expectation.
>>>>
>>>> I raised this question because on this problem I have to predict the
>>>> max energy load of each day, however I have half-hourly data, so I'm
>>>> actually discarding a lot of samples to feed to the CLA just the max load
>>>> of each day. My idea is to use the half-hourly data and then do this
>>>> prediction feedback so I can predict the half-hourly energy load for the
>>>> whole month, and then I can take the max load of each day by hand. I still
>>>> haven't done this because this is gonna be much more challenging, but it is
>>>> worth the shot even if it is just for "scientific" reasons.
>>>>
>>>> Do you have any ideia on how to use the half-hourly data in a sensible
>>>> way?
>>>>
>>>> Your suggestion to do swarming on 31 different models is great, I was
>>>> just stuck thinking of doing only the 1,...,31-step predictions with one
>>>> single model, but as you said the classifier uses a lot a of memory this
>>>> way and ends up being much slower than it'd be with separate models. I will
>>>> try to get swarming running on the VM and then try to do this, it seems
>>>> like the best shot for a good result.
>>>>
>>>> Thanks a lot, it was really helpful!
>>>>
>>>> Pedro.
>>>>
>>>>
>>>> On Thu, Oct 3, 2013 at 5:32 PM, Subutai Ahmad <[email protected]>wrote:
>>>>
>>>>> Hi Pedro,
>>>>>
>>>>> That's encouraging news!  Having your results documented will be
>>>>> really helpful to everyone.  Here's an attempt to answer your main 
>>>>> question:
>>>>>
>>>>> 1) My feeling is similar to yours - in general I don't think
>>>>> recursively feeding in classifier predictions is a good idea for 
>>>>> predicting
>>>>> many steps ahead. There are multiple predictions made at each time step.
>>>>> These predictions branch into the future and weird things can happen.
>>>>> Suppose we fed in the most likely prediction at each time step.  Here's a
>>>>> simple failure case:
>>>>>
>>>>> A  -> B (0.4) -> D (0.1)
>>>>>  |---> C (0.3) -> E (1.0)
>>>>>
>>>>> In this data, after A you get B with 40% chance and C with 30% chance.
>>>>> After B the most likely element is D but it only has 10% chance. E always
>>>>> follows C with 100% probability.  If you feed the most likely prediction
>>>>> from A back into the system, you would predict D two steps ahead. However,
>>>>> E is a better 2-step prediction starting from A.
>>>>>
>>>>> Other issues can happen. Quite often the probabilities for the various
>>>>> predictions are quite similar. If you just follow the most likely path 
>>>>> then
>>>>> a small mistake (e.g. a small amount of noise) could throw it off.   If 
>>>>> you
>>>>> could somehow feed in all the probabilities at each time step then maybe
>>>>> you can do a better job but that would be a lot more involved and I'm not
>>>>> really sure how to do it with CLA.
>>>>>
>>>>>
>>>>> For multi step predictions we have tried the following options:
>>>>>
>>>>> a) For x=1 .. 31, train 31 different models, each predicting x steps
>>>>> ahead. Each model is swarmed specifically for x.  This gives the best
>>>>> results since the parameters for predicting one month into the future 
>>>>> could
>>>>> be different from 1 day into the future. It sounds similar to what you did
>>>>> except for custom swarming. Unfortunately, this is the most time consuming
>>>>> because of the swarming step. Once you get swarming working, you might 
>>>>> want
>>>>> to try this with just one 7 step ahead model and see if that is better 
>>>>> than
>>>>> your current 7 step model.
>>>>>
>>>>> b) Train one model to predict 31 days ahead and accumulate the results
>>>>> to get all the predictions. So, tomorrow's prediction would have been made
>>>>> 30 days ago by this model. Surprisingly, in some situations with very
>>>>> regular data this works pretty well.  Quite often it's not as good as a).
>>>>>
>>>>> c) A combination of the above. For example, train 3 models to predict
>>>>> 1 day, 7 days, and 31 days in advance. Accumulate using the closest 
>>>>> models.
>>>>> This is a compromise that can work well.
>>>>>
>>>>> d) Train a single model to predict 1, 2, 3, …, 31 steps ahead (i.e.
>>>>> all of them). You can do this by specifying a list of steps for steps
>>>>> ahead. We've had problems with this though.  The classifier can take up a
>>>>> lot of memory in this setup. Also, often a single set of parameters 
>>>>> doesn't
>>>>> work well for all time ranges.
>>>>>
>>>>>
>>>>> Other questions:
>>>>>
>>>>> 2) It should. Scott might know better.
>>>>>
>>>>> 3) I don't know - again Scott might know this. If I remember correctly
>>>>> finishLearning is just an optimization step so you can ignore it. Turning
>>>>> learning off with disableLearning should work for testing.
>>>>>
>>>>> 4) Yes, you can run swarming within the VM. The main extra step is
>>>>> that you need to install MySQL. There is a test script in "python
>>>>> examples/swarm/test_db.py" to test that the DB is working. If that works
>>>>> swarming should work. See
>>>>> https://github.com/numenta/nupic/wiki/Running-Swarms for details.
>>>>>
>>>>> This ended up being a really long email!  Hopefully it was helpful.
>>>>>
>>>>> --Subutai
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 3, 2013 at 9:13 AM, Pedro Tabacof <[email protected]>wrote:
>>>>>
>>>>>> Matt, I haven't uploaded my code anywhere yet. I'd like to try more a
>>>>>> few more things (which depend on the questions I asked) before I do this
>>>>>> because I know when I upload the code and post the results here I 
>>>>>> probably
>>>>>> won't try to improve or change anything. I only work well under pressure
>>>>>> lol.
>>>>>>
>>>>>> Since I'm gonna be away this weekend, I hope that by the end of next
>>>>>> week I will set up a github page with everything (explanation of the
>>>>>> problem, dataset, code and results with competition comparisons).
>>>>>>
>>>>>> Pedro.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 3, 2013 at 12:56 PM, Matthew Taylor <[email protected]>wrote:
>>>>>>
>>>>>>> Pedro, this is exciting! Is your code available online anywhere? Any
>>>>>>> chance you can put it up on github or bitbucket?
>>>>>>>
>>>>>>> ---------
>>>>>>> Matt Taylor
>>>>>>> OS Community Flag-Bearer
>>>>>>> Numenta
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 3, 2013 at 6:59 AM, Pedro Tabacof <[email protected]>wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I've been working with an energy competition dataset [1] and I've
>>>>>>>> been experimenting with some different ways to predict many steps 
>>>>>>>> ahead (I
>>>>>>>> have to predict 31 different energy loads for the whole month). This 
>>>>>>>> led me
>>>>>>>> to some questions:
>>>>>>>>
>>>>>>>> 1) Has anyone tried feeding one-step classifier predictions back to
>>>>>>>> the input? This can be done easily by hand but I'm not sure if this is 
>>>>>>>> a
>>>>>>>> good idea for many steps prediction.
>>>>>>>>
>>>>>>>> 2) Does "disableLearning" also turn off classifier learning? If
>>>>>>>> not, how do I do it?
>>>>>>>>
>>>>>>>> 3) Is "finishLearning" deprecated? I tried using it but I got an
>>>>>>>> error message.
>>>>>>>>
>>>>>>>> 4) Is it possible run swarming within the Vagrant VM? What about
>>>>>>>> Cerebro?
>>>>>>>>
>>>>>>>> On a side note, so far I have achieved 3.3% MAPE on the test data,
>>>>>>>> which would put me among the top 10 competitors (out of 26), with 
>>>>>>>> pretty
>>>>>>>> much the basic NuPIC configuration, very similar to the hotgym example.
>>>>>>>>
>>>>>>>> I have experimented with 31-step predictions and 1,2,3,...,31
>>>>>>>> predictions, but this was too slow and didn't improve the results. 
>>>>>>>> When I
>>>>>>>> finish testing all my ideas, I will post my results and experience 
>>>>>>>> here.
>>>>>>>>
>>>>>>>> Pedro.
>>>>>>>>
>>>>>>>> [1] http://neuron.tuke.sk/competition/index.php
>>>>>>>> --
>>>>>>>> Pedro Tabacof,
>>>>>>>> Unicamp - Eng. de Computação 08.
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> nupic mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Pedro Tabacof,
>>>>>> Unicamp - Eng. de Computação 08.
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Pedro Tabacof,
>>>> Unicamp - Eng. de Computação 08.
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>>
>> --
>> Pedro Tabacof,
>> Unicamp - Eng. de Computação 08.
>>
>
>
>
> --
> Pedro Tabacof,
> Unicamp - Eng. de Computação 08.
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] Many-steps prediction

Reply via email to