Re: Statistics Tool For Classification/Clustering
Good places to start: Optimal feature extractors, that's better than PCA because you whiten your inter class scatter and so put all inter class comparisons on the same level. The good thing is this will also reduce your feature vector dimensionality to c-1 (where c is # classes). PCA will not do this. Check the stats of each class, is it Gaussian or known pdf? Apply parameteric classifier if so. However you are lucky if you get good classification after this, so you will probably need non linear, non parametric classifiers. Try K nearest neighobour, but that might take the age of the Universe so use a condensing algorithm first to get a smaller representative set. Matlab is what I use for coding, there are a lot of free toolboxes around. Mostly I write my own though. Best wishes Andrew "Rishabh Gupta" <[EMAIL PROTECTED]> wrote in message news:a4eje9$ip8$[EMAIL PROTECTED].; > Hi All, > I'm a research student at the Department Of Electronics, University Of > York, UK. I'm working a project related to music analysis and > classification. I am at the stage where I perform some analysis on music > files (currently only in MIDI format) and extract about 500 variables that > are related to music properties like pitch, rhythm, polyphony and volume. I > am performing basic analysis like mean and standard deviation but then I > also perform more elaborate analysis like measuring complexity of melody and > rhythm. > > The aim is that the variables obtained can be used to perform a number of > different operations. > - The variables can be used to classify / categorise each piece of > music, on its own, in terms of some meta classifier (e.g. rock, pop, > classical). > - The variables can be used to perform comparison between two files. A > variable from one music file can be compared to the equivalent variable in > the other music file. By comparing all the variables in one file with the > equivalent variable in the other file, an overall similarity measurement can > be obtained. > > The next stage is to test the ability of the of the variables obtained to > perform the classification / comparison. I need to identify variables that > are redundant (redundant in the sense of 'they do not provide any > information' and 'they provide the same information as the other variable') > so that they can be removed and I need to identify variables that are > distinguishing (provide the most amount of information). > > My Basic Questions Are: > - What are the best statistical techniques / methods that should be > applied here. E.g. I have looked at Principal Component Analysis; this would > be a good method to remove the redundant variables and hence reduce some the > amount of data that needs to be processed. Can anyone suggest any other > sensible statistical anaysis methods? > - What are the ideal tools / software to perform the clustering / > classification. I have access to SPSS software but I have never used it > before and am not really sure how to apply it or whether it is any good when > dealing with 100s of variables. > > So far I have been analysing each variable on its own 'by eye' by plotting > the mean and sd for all music files. However this approach is not feasible > in the long term since I am dealing with such a large number of variables. > In addition, by looking at each variable on its own, I do not find clusters > / patterns that are only visible through multivariate analysis. If anyone > can recommend a better approach I would be greatly appreciated. > > Any help or suggestion that can be offered will be greatly appreciated. > > Many Thanks! > > Rishabh Gupta > > = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
Corection typo: Should read 'Whiten intra class scatter' "Mark Harrison" <[EMAIL PROTECTED]> wrote in message news:FIif8.16518$[EMAIL PROTECTED].; > Good places to start: > > Optimal feature extractors, that's better than PCA because you whiten your > inter class scatter and so put all inter class comparisons on the same > level. The good thing is this will also reduce your feature vector > dimensionality to c-1 (where c is # classes). PCA will not do this. > > Check the stats of each class, is it Gaussian or known pdf? Apply > parameteric classifier if so. > > However you are lucky if you get good classification after this, so you will > probably need non linear, non parametric classifiers. Try K nearest > neighobour, but that might take the age of the Universe so use a condensing > algorithm first to get a smaller representative set. > > Matlab is what I use for coding, there are a lot of free toolboxes around. > Mostly I write my own though. > > Best wishes > > Andrew > > > "Rishabh Gupta" <[EMAIL PROTECTED]> wrote in message > news:a4eje9$ip8$[EMAIL PROTECTED].; > > Hi All, > > I'm a research student at the Department Of Electronics, University Of > > York, UK. I'm working a project related to music analysis and > > classification. I am at the stage where I perform some analysis on music > > files (currently only in MIDI format) and extract about 500 variables that > > are related to music properties like pitch, rhythm, polyphony and volume. > I > > am performing basic analysis like mean and standard deviation but then I > > also perform more elaborate analysis like measuring complexity of melody > and > > rhythm. > > > > The aim is that the variables obtained can be used to perform a number of > > different operations. > > - The variables can be used to classify / categorise each piece of > > music, on its own, in terms of some meta classifier (e.g. rock, pop, > > classical). > > - The variables can be used to perform comparison between two files. A > > variable from one music file can be compared to the equivalent variable in > > the other music file. By comparing all the variables in one file with the > > equivalent variable in the other file, an overall similarity measurement > can > > be obtained. > > > > The next stage is to test the ability of the of the variables obtained to > > perform the classification / comparison. I need to identify variables that > > are redundant (redundant in the sense of 'they do not provide any > > information' and 'they provide the same information as the other > variable') > > so that they can be removed and I need to identify variables that are > > distinguishing (provide the most amount of information). > > > > My Basic Questions Are: > > - What are the best statistical techniques / methods that should be > > applied here. E.g. I have looked at Principal Component Analysis; this > would > > be a good method to remove the redundant variables and hence reduce some > the > > amount of data that needs to be processed. Can anyone suggest any other > > sensible statistical anaysis methods? > > - What are the ideal tools / software to perform the clustering / > > classification. I have access to SPSS software but I have never used it > > before and am not really sure how to apply it or whether it is any good > when > > dealing with 100s of variables. > > > > So far I have been analysing each variable on its own 'by eye' by plotting > > the mean and sd for all music files. However this approach is not feasible > > in the long term since I am dealing with such a large number of variables. > > In addition, by looking at each variable on its own, I do not find > clusters > > / patterns that are only visible through multivariate analysis. If anyone > > can recommend a better approach I would be greatly appreciated. > > > > Any help or suggestion that can be offered will be greatly appreciated. > > > > Many Thanks! > > > > Rishabh Gupta > > > > > > = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
Rishabh Gupta <[EMAIL PROTECTED]> wrote in message a4eje9$ip8$[EMAIL PROTECTED]">news:a4eje9$ip8$[EMAIL PROTECTED]... > Hi All, > I'm a research student at the Department Of Electronics, University Of > York, UK. I'm working a project related to music analysis and > classification. == Pleased to see you have had many suggestions. But I would have thought you are sitting right on top of all the books you may need on the shelves in the university library. = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
Hi all, I recieved numerous replies to my query. I can't thanks everyone individually so I want to thank everyone who has replied. I am now looking through the information and links that you have provided. Many Thanks For All Your Help!! Rishabh "Rishabh Gupta" <[EMAIL PROTECTED]> wrote in message a4eje9$ip8$[EMAIL PROTECTED]">news:a4eje9$ip8$[EMAIL PROTECTED]... > Hi All, > I'm a research student at the Department Of Electronics, University Of > York, UK. I'm working a project related to music analysis and > classification. I am at the stage where I perform some analysis on music > files (currently only in MIDI format) and extract about 500 variables that > are related to music properties like pitch, rhythm, polyphony and volume. I > am performing basic analysis like mean and standard deviation but then I > also perform more elaborate analysis like measuring complexity of melody and > rhythm. > > The aim is that the variables obtained can be used to perform a number of > different operations. > - The variables can be used to classify / categorise each piece of > music, on its own, in terms of some meta classifier (e.g. rock, pop, > classical). > - The variables can be used to perform comparison between two files. A > variable from one music file can be compared to the equivalent variable in > the other music file. By comparing all the variables in one file with the > equivalent variable in the other file, an overall similarity measurement can > be obtained. > > The next stage is to test the ability of the of the variables obtained to > perform the classification / comparison. I need to identify variables that > are redundant (redundant in the sense of 'they do not provide any > information' and 'they provide the same information as the other variable') > so that they can be removed and I need to identify variables that are > distinguishing (provide the most amount of information). > > My Basic Questions Are: > - What are the best statistical techniques / methods that should be > applied here. E.g. I have looked at Principal Component Analysis; this would > be a good method to remove the redundant variables and hence reduce some the > amount of data that needs to be processed. Can anyone suggest any other > sensible statistical anaysis methods? > - What are the ideal tools / software to perform the clustering / > classification. I have access to SPSS software but I have never used it > before and am not really sure how to apply it or whether it is any good when > dealing with 100s of variables. > > So far I have been analysing each variable on its own 'by eye' by plotting > the mean and sd for all music files. However this approach is not feasible > in the long term since I am dealing with such a large number of variables. > In addition, by looking at each variable on its own, I do not find clusters > / patterns that are only visible through multivariate analysis. If anyone > can recommend a better approach I would be greatly appreciated. > > Any help or suggestion that can be offered will be greatly appreciated. > > Many Thanks! > > Rishabh Gupta > > = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
"Richard Wright" <[EMAIL PROTECTED]> wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... > Genres are presumably groups. So linear combinations of variables that > best separate the genres would be more effectively found by linear > canonical variates analysis (aka discriminant analysis). > > Richard Wright > > > On Thu, 14 Feb 2002 03:18:48 GMT, "Jim Snow" <[EMAIL PROTECTED]> > wrote: > > > snipped > >My inclination would be to start with an Andrews plot, possibly > >using principal component scores for about 20 music files from several > >genres. This will enable you to find linear combinations of variable which > >best separate the genres. The technique and examples is set out in: > snipped > Andrews plots and similar techniques do not replace discriminant analysis, which , as Richard Wright said " finds linear combinations of variables that best separate the variables" . In the book by Gnanadesikan which first popularised the technique, he examines the variables in the discriminant space, ie a space defined by discriminant functions rather than principal components or original variables. The techniques are doing different things. Andrews plots are to enable examination of the multidimensional data in a two dimensional plot. Amongst other things, for example, several dimensions of high difference between say jazz and pop or between jazz and flamenco may be found,which are not necessarily orthogonal. Andrews plots are a data reduction technique which is ,in many dimensions, analogous to examining a multi dimensional cluster of points from many viewpoints ,so that no possible view point is far from one of those used. Thus virtually all possible discriminant functions are tried and the interesting ones noted. In a spirit of exploratory data analysis, this seems useful. RishadhGupta wrote: -" The variables can be used to perform comparison between two files. A variable from one music file can be compared to the equivalent variable in the other music file. By comparing all the variables in one file with the equivalent variable in the other file, an overall similarity measurement can be obtained." Andrews plots reveal the directions in which the two files differ. Incidentally, the total area between the two traces on the plot is the Euclidean distance, I think, if the original Andrews weightings are used. Tukey suggested weightings which examine the multidimensional space more closely but do not have such a simple interpretation of the difference between traces. I have not used any of this for some time and I do not have relevant books, but the material I referred to on the web should be helpful. Straightforward discriminant analysis will certainly find the best linear discriminator in the least squares sense, but stepwise elimination of variables in this process may result in discarding a variable with intuitive appeal in favour of one or several highly correlated with it and the least squares metric may possibly not be the best. For this and other reasons an exploratory approach as Rishabh Gupta has begun seems appropriate. I still hope this helps Jim Snow = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
You might consider a form of PLS - your measurmenets may be highly correlated, and only a very few can do you any good. You have a great many output vars, and few enough inputs. Jay Rishabh Gupta wrote: > Hi All, > I'm a research student at the Department Of Electronics, University Of > York, UK. I'm working a project related to music analysis and > classification. I am at the stage where I perform some analysis on music > files (currently only in MIDI format) and extract about 500 variables that > are related to music properties like pitch, rhythm, polyphony and volume. I > am performing basic analysis like mean and standard deviation but then I > also perform more elaborate analysis like measuring complexity of melody and > rhythm. > > The aim is that the variables obtained can be used to perform a number of > different operations. > - The variables can be used to classify / categorise each piece of > music, on its own, in terms of some meta classifier (e.g. rock, pop, > classical). > - The variables can be used to perform comparison between two files. A > variable from one music file can be compared to the equivalent variable in > the other music file. By comparing all the variables in one file with the > equivalent variable in the other file, an overall similarity measurement can > be obtained. > > The next stage is to test the ability of the of the variables obtained to > perform the classification / comparison. I need to identify variables that > are redundant (redundant in the sense of 'they do not provide any > information' and 'they provide the same information as the other variable') > so that they can be removed and I need to identify variables that are > distinguishing (provide the most amount of information). > > My Basic Questions Are: > - What are the best statistical techniques / methods that should be > applied here. E.g. I have looked at Principal Component Analysis; this would > be a good method to remove the redundant variables and hence reduce some the > amount of data that needs to be processed. Can anyone suggest any other > sensible statistical anaysis methods? > - What are the ideal tools / software to perform the clustering / > classification. I have access to SPSS software but I have never used it > before and am not really sure how to apply it or whether it is any good when > dealing with 100s of variables. > > So far I have been analysing each variable on its own 'by eye' by plotting > the mean and sd for all music files. However this approach is not feasible > in the long term since I am dealing with such a large number of variables. > In addition, by looking at each variable on its own, I do not find clusters > / patterns that are only visible through multivariate analysis. If anyone > can recommend a better approach I would be greatly appreciated. > > Any help or suggestion that can be offered will be greatly appreciated. > > Many Thanks! > > Rishabh Gupta > > = > Instructions for joining and leaving this list, remarks about the > problem of INAPPROPRIATE MESSAGES, and archives are available at > http://jse.stat.ncsu.edu/ > = -- Jay Warner Principal Scientist Warner Consulting, Inc. North Green Bay Road Racine, WI 53404-1216 USA Ph: (262) 634-9100 FAX: (262) 681-1133 email: [EMAIL PROTECTED] web: http://www.a2q.com The A2Q Method (tm) -- What do you want to improve today? = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
Genres are presumably groups. So linear combinations of variables that best separate the genres would be more effectively found by linear canonical variates analysis (aka discriminant analysis). Richard Wright On Thu, 14 Feb 2002 03:18:48 GMT, "Jim Snow" <[EMAIL PROTECTED]> wrote: snipped >My inclination would be to start with an Andrews plot, possibly >using principal component scores for about 20 music files from several >genres. This will enable you to find linear combinations of variable which >best separate the genres. The technique and examples is set out in: snipped = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
"Rishabh Gupta" <[EMAIL PROTECTED]> wrote in message a4eje9$ip8$[EMAIL PROTECTED]">news:a4eje9$ip8$[EMAIL PROTECTED]... > Hi All, > I'm a research student at the Department Of Electronics, University Of > York, UK. I'm working a project related to music analysis and > classification. I am at the stage where I perform some analysis on music > files (currently only in MIDI format) and extract about 500 variables that > are related to music properties like pitch, rhythm, polyphony and volume. I > am performing basic analysis like mean and standard deviation but then I > also perform more elaborate analysis like measuring complexity of melody and > rhythm. > > The aim is that the variables obtained can be used to perform a number of > different operations. > - The variables can be used to classify / categorise each piece of > music, on its own, in terms of some meta classifier (e.g. rock, pop, > classical). > - The variables can be used to perform comparison between two files. A > variable from one music file can be compared to the equivalent variable in > the other music file. By comparing all the variables in one file with the > equivalent variable in the other file, an overall similarity measurement can > be obtained. > > The next stage is to test the ability of the of the variables obtained to > perform the classification / comparison. I need to identify variables that > are redundant (redundant in the sense of 'they do not provide any > information' and 'they provide the same information as the other variable') > so that they can be removed and I need to identify variables that are > distinguishing (provide the most amount of information). > > My Basic Questions Are: > - What are the best statistical techniques / methods that should be > applied here. E.g. I have looked at Principal Component Analysis; this would > be a good method to remove the redundant variables and hence reduce some the > amount of data that needs to be processed. Can anyone suggest any other > sensible statistical anaysis methods? > - What are the ideal tools / software to perform the clustering / > classification. I have access to SPSS software but I have never used it > before and am not really sure how to apply it or whether it is any good when > dealing with 100s of variables. > > So far I have been analysing each variable on its own 'by eye' by plotting > the mean and sd for all music files. However this approach is not feasible > in the long term since I am dealing with such a large number of variables. > In addition, by looking at each variable on its own, I do not find clusters > / patterns that are only visible through multivariate analysis. If anyone > can recommend a better approach I would be greatly appreciated. > > Any help or suggestion that can be offered will be greatly appreciated. A useful exposition of techniques for initial investigation of multivariate data set is given at http://www.sas.com/service/library/periodicals/obs/obswww22/ If you point your browser at " Andrews plots " you will find more. My inclination would be to start with an Andrews plot, possibly using principal component scores for about 20 music files from several genres. This will enable you to find linear combinations of variable which best separate the genres. The technique and examples is set out in: Gnanadesikan:Multivariate Data Analysis, but this is an old reference. I hope this helps Jim Snow = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
classification is a specialized field go to http://www.pitt.edu/~csna/ and click on although this is the Classification Society of North America members of the British Classification Society also follow it. SPSS should be able to handle what you want to do. However, you need face-to-face consulting/collaboration with someone who does this kind of analysis. Many of the techniques grew out of psychology so if CLASS-L doesn't help you might try your local psycholgy departments. Rishabh Gupta wrote: > Hi All, > I'm a research student at the Department Of Electronics, University Of > York, UK. I'm working a project related to music analysis and > classification. I am at the stage where I perform some analysis on music > files (currently only in MIDI format) and extract about 500 variables that > are related to music properties like pitch, rhythm, polyphony and volume. I > am performing basic analysis like mean and standard deviation but then I > also perform more elaborate analysis like measuring complexity of melody and > rhythm. > > The aim is that the variables obtained can be used to perform a number of > different operations. > - The variables can be used to classify / categorise each piece of > music, on its own, in terms of some meta classifier (e.g. rock, pop, > classical). > - The variables can be used to perform comparison between two files. A > variable from one music file can be compared to the equivalent variable in > the other music file. By comparing all the variables in one file with the > equivalent variable in the other file, an overall similarity measurement can > be obtained. > > The next stage is to test the ability of the of the variables obtained to > perform the classification / comparison. I need to identify variables that > are redundant (redundant in the sense of 'they do not provide any > information' and 'they provide the same information as the other variable') > so that they can be removed and I need to identify variables that are > distinguishing (provide the most amount of information). > > My Basic Questions Are: > - What are the best statistical techniques / methods that should be > applied here. E.g. I have looked at Principal Component Analysis; this would > be a good method to remove the redundant variables and hence reduce some the > amount of data that needs to be processed. Can anyone suggest any other > sensible statistical anaysis methods? > - What are the ideal tools / software to perform the clustering / > classification. I have access to SPSS software but I have never used it > before and am not really sure how to apply it or whether it is any good when > dealing with 100s of variables. > > So far I have been analysing each variable on its own 'by eye' by plotting > the mean and sd for all music files. However this approach is not feasible > in the long term since I am dealing with such a large number of variables. > In addition, by looking at each variable on its own, I do not find clusters > / patterns that are only visible through multivariate analysis. If anyone > can recommend a better approach I would be greatly appreciated. > > Any help or suggestion that can be offered will be greatly appreciated. > > Many Thanks! > > Rishabh Gupta = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
In sci.stat.math Rishabh Gupta <[EMAIL PROTECTED]> wrote: [ snip ] It seems that you are new to the field of pattern recognition. In that case, you may want to check out the classic book "Pattern Classification" by Duda, Hart and Stork. There is a second edition that came out in 2001. It is a classic of the field, and you may find other insights useful to your problem. M Law = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
"Rishabh Gupta" <[EMAIL PROTECTED]> wrote in a4eje9$ip8$[EMAIL PROTECTED]:">news:a4eje9$ip8$[EMAIL PROTECTED]: > Hi All, > I'm a research student at the Department Of Electronics, University > Of > York, UK. I'm working a project related to music analysis and > classification. I am at the stage where I perform some analysis on > music files (currently only in MIDI format) and extract about 500 > variables that are related to music properties like pitch, rhythm, > polyphony and volume. I am performing basic analysis like mean and > standard deviation but then I also perform more elaborate analysis like > measuring complexity of melody and rhythm. > > The aim is that the variables obtained can be used to perform a number > of different operations. > - The variables can be used to classify / categorise each piece of > music, on its own, in terms of some meta classifier (e.g. rock, pop, > classical). > - The variables can be used to perform comparison between two > files. A > variable from one music file can be compared to the equivalent variable > in the other music file. By comparing all the variables in one file > with the equivalent variable in the other file, an overall similarity > measurement can be obtained. > > The next stage is to test the ability of the of the variables obtained > to perform the classification / comparison. I need to identify > variables that are redundant (redundant in the sense of 'they do not > provide any information' and 'they provide the same information as the > other variable') so that they can be removed and I need to identify > variables that are distinguishing (provide the most amount of > information). > > My Basic Questions Are: > - What are the best statistical techniques / methods that should be > applied here. E.g. I have looked at Principal Component Analysis; this > would be a good method to remove the redundant variables and hence > reduce some the amount of data that needs to be processed. Can anyone > suggest any other sensible statistical anaysis methods? > - What are the ideal tools / software to perform the clustering / > classification. I have access to SPSS software but I have never used it > before and am not really sure how to apply it or whether it is any good > when dealing with 100s of variables. > > So far I have been analysing each variable on its own 'by eye' by > plotting the mean and sd for all music files. However this approach is > not feasible in the long term since I am dealing with such a large > number of variables. In addition, by looking at each variable on its > own, I do not find clusters / patterns that are only visible through > multivariate analysis. If anyone can recommend a better approach I > would be greatly appreciated. > > Any help or suggestion that can be offered will be greatly appreciated. > > Many Thanks! > > Rishabh Gupta In SPSS, Factor Analysis would help you reduce your many variables down to bigger, more general ones. As well, Cluster Analysis will let you see how your variables group themselves. The results might look like the following: Factor 1: (percussiveness) volume of drums number of drum types drum melodies... Factor 2: (happiness) minor modes speed pitch... Factor 3: (memorableness) melodic structure folk music precursor The cluster analysis would be similar, but would have the variables on a branching tree that showed that speed and pitch were closer than drum type and folk precursor, say. Would be interesting to see how this works. I wonder if you could calculate some kind of fractal dimension for the music too? Doug H = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =