Firstly you need to preprocess your data a good tool for that is PANDAS. That 
is 60% of any machine learning task as you will see. What is the goal you are 
trying to achieve? 

If you don't have labelled data, again I only glanced at your post.
Unsupervised learning is a good way to go in which case you'd look at 
clustering methods, start with k-means clustering. In the case of k-means 
clustering you need to know something about your data in that you need to know 
the approximate number of clusters that your data might have or some basic 
information about how your data is organized. This approach should only be 
adopted in the case when you don't have well defined classes and just want to 
make some sense of your data. 

If you do have labelled data, meaning if you have a finite number of classes 
that any instance in your data can be classified into you might start with 
logistic regression or k-nearest neighbor classification and increase 
complexity of your classifiers based on the results of these classifiers. Look 
at SVM or Neural Networks.

There exists another field of machine learning  called semi-supervised which 
you can look into after you have explored these preliminary approaches. This 
last sub-field is just for your information. Anyone of these methods can be 
adopted depending on your exact goals and the way you formulate your problem.

Hope this helps,
Nikhil
Sent from my iPhone

> On Jan 21, 2014, at 5:47 AM, "Glenn Pierce" <glennpie...@gmail.com> wrote:
> 
> Hi, 
> I am new to machine learning and was wondering if I could be pointed in the 
> direction for further reading with respect to my particular problem.
> 
> I have a lot of data from various sensors (Electricity, Temperature, 
> Humidity, Gas / Water Usage etc)
> 
> This data can be corrupt in many ways. For example
> 
> 1, The data can have one off zero values
> 2, The data can have large gaps of zero's ranging from hours to weeks
> 3, The can be both positive and negative large spikes due to interference.
> 4, The data can be stuck high for a while due to various reasons.
> 
> Currently I am cleaning the data by replacing short gaps with the average 
> values of the two points it is between.
> Large gaps get replaced with the median of the data for all of its time 
> values.
> Spikes and one off zeros are replace with an adjacent value.
> 
> The spike is determined from a threshold of 3dp from the median value.
> 
> This works a lot but misses problems like a constant high value just below 
> the threshold chosen and seems problematic
> for certain types of data.
> 
> I thought a machine learning approach may be more flexible / robust ? Maybe I 
> am wrong with that assumption though.
> 
> Has anyone got any advice on which area of machine learning I should explore 
> first.
> Or maybe my problem is not suited to it ?
> 
> Thanks for any advice. 
> 
> Glenn
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today. 
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to