Firstly you need to preprocess your data a good tool for that is PANDAS. That is 60% of any machine learning task as you will see. What is the goal you are trying to achieve?
If you don't have labelled data, again I only glanced at your post. Unsupervised learning is a good way to go in which case you'd look at clustering methods, start with k-means clustering. In the case of k-means clustering you need to know something about your data in that you need to know the approximate number of clusters that your data might have or some basic information about how your data is organized. This approach should only be adopted in the case when you don't have well defined classes and just want to make some sense of your data. If you do have labelled data, meaning if you have a finite number of classes that any instance in your data can be classified into you might start with logistic regression or k-nearest neighbor classification and increase complexity of your classifiers based on the results of these classifiers. Look at SVM or Neural Networks. There exists another field of machine learning called semi-supervised which you can look into after you have explored these preliminary approaches. This last sub-field is just for your information. Anyone of these methods can be adopted depending on your exact goals and the way you formulate your problem. Hope this helps, Nikhil Sent from my iPhone > On Jan 21, 2014, at 5:47 AM, "Glenn Pierce" <glennpie...@gmail.com> wrote: > > Hi, > I am new to machine learning and was wondering if I could be pointed in the > direction for further reading with respect to my particular problem. > > I have a lot of data from various sensors (Electricity, Temperature, > Humidity, Gas / Water Usage etc) > > This data can be corrupt in many ways. For example > > 1, The data can have one off zero values > 2, The data can have large gaps of zero's ranging from hours to weeks > 3, The can be both positive and negative large spikes due to interference. > 4, The data can be stuck high for a while due to various reasons. > > Currently I am cleaning the data by replacing short gaps with the average > values of the two points it is between. > Large gaps get replaced with the median of the data for all of its time > values. > Spikes and one off zeros are replace with an adjacent value. > > The spike is determined from a threshold of 3dp from the median value. > > This works a lot but misses problems like a constant high value just below > the threshold chosen and seems problematic > for certain types of data. > > I thought a machine learning approach may be more flexible / robust ? Maybe I > am wrong with that assumption though. > > Has anyone got any advice on which area of machine learning I should explore > first. > Or maybe my problem is not suited to it ? > > Thanks for any advice. > > Glenn > ------------------------------------------------------------------------------ > CenturyLink Cloud: The Leader in Enterprise Cloud Services. > Learn Why More Businesses Are Choosing CenturyLink Cloud For > Critical Workloads, Development Environments & Everything In Between. > Get a Quote or Start a Free Trial Today. > http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general