On 21 January 2014 11:24, nmura...@masonlive.gmu.edu <
nmura...@masonlive.gmu.edu> wrote:

> Firstly you need to preprocess your data a good tool for that is PANDAS.
> That is 60% of any machine learning task as you will see. What is the goal
> you are trying to achieve?
>
> Really just replacement of the obviously wrong data with something that
could be correct. I don't need to be exact. the original data will see be
available.


> If you don't have labelled data, again I only glanced at your post.
> Unsupervised learning is a good way to go in which case you'd look at
> clustering methods, start with k-means clustering. In the case of k-means
> clustering you need to know something about your data in that you need to
> know the approximate number of clusters that your data might have or some
> basic information about how your data is organized. This approach should
> only be adopted in the case when you don't have well defined classes and
> just want to make some sense of your data.
>
> If you do have labelled data, meaning if you have a finite number of
> classes that any instance in your data can be classified into you might
> start with logistic regression or k-nearest neighbor classification and
> increase complexity of your classifiers based on the results of these
> classifiers. Look at SVM or Neural Networks.
>
> There exists another field of machine learning  called semi-supervised
> which you can look into after you have explored these preliminary
> approaches. This last sub-field is just for your information. Anyone of
> these methods can be adopted depending on your exact goals and the way you
> formulate your problem.
>
>

Thanks for that . I will start with the k-means clustering.


> Hope this helps,
> Nikhil
> Sent from my iPhone
>
> > On Jan 21, 2014, at 5:47 AM, "Glenn Pierce" <glennpie...@gmail.com>
> wrote:
> >
> > Hi,
> > I am new to machine learning and was wondering if I could be pointed in
> the direction for further reading with respect to my particular problem.
> >
> > I have a lot of data from various sensors (Electricity, Temperature,
> Humidity, Gas / Water Usage etc)
> >
> > This data can be corrupt in many ways. For example
> >
> > 1, The data can have one off zero values
> > 2, The data can have large gaps of zero's ranging from hours to weeks
> > 3, The can be both positive and negative large spikes due to
> interference.
> > 4, The data can be stuck high for a while due to various reasons.
> >
> > Currently I am cleaning the data by replacing short gaps with the
> average values of the two points it is between.
> > Large gaps get replaced with the median of the data for all of its time
> values.
> > Spikes and one off zeros are replace with an adjacent value.
> >
> > The spike is determined from a threshold of 3dp from the median value.
> >
> > This works a lot but misses problems like a constant high value just
> below the threshold chosen and seems problematic
> > for certain types of data.
> >
> > I thought a machine learning approach may be more flexible / robust ?
> Maybe I am wrong with that assumption though.
> >
> > Has anyone got any advice on which area of machine learning I should
> explore first.
> > Or maybe my problem is not suited to it ?
> >
> > Thanks for any advice.
> >
> > Glenn
> >
> ------------------------------------------------------------------------------
> > CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> > Learn Why More Businesses Are Choosing CenturyLink Cloud For
> > Critical Workloads, Development Environments & Everything In Between.
> > Get a Quote or Start a Free Trial Today.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to