Firstly, I apologize in advance if the following questions I have are very 
basic. I am very new to coding in general, as well as Principal Component 
Analysis and scikit-learn. I am trying to finish a project for an internship 
and hit a wall, and am desperately trying to seek help to solve this before my 
deadline.


I have a set of RNA sequences that are evaluated based on several parameters. 
For the sake of simplicity, let's say there are three parameters: the GC 
content in the RNA sequence's ribosome binding site (RBS), the estimated 
stability of the RNA sequences' secondary structures (MFE), and the ensemble 
defect of the RNA (i.e. the number of nucleotides that do not conform to a 
prescribed secondary structure). A series of functions calculated the values of 
these RNA sequences for each parameter, and the lower their calculated value 
is, the more indicative it is that the RNA sequence in question is more optimal 
for our experimental purposes.


What I am now trying to do is build a composite score from the values in these 
parameters for each RNA sequence using PCA. Rather than use PCA for dimension 
reduction, I am going to use it to determine the loading values for each value 
to their respective component. I have been using the following code to 
calculate the loadings and subsequently utilize them for the creation of a 
composite score, which is the sum of the original values multiplied by their 
corresponding loading values (i.e. the loading values are used as weights on 
the original values).


import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

green= path.join(output_folder, "SequenceScoring.csv")
df = pd.read_csv(green)
X= df.values
X = scale(X)
pca = PCA(n_components=3)
pca.fit(X)
X1=pca.fit_transform(X)
df1 = pd.DataFrame(data= X1, index= range(len(df['MFE of Sequence Complex'])), 
columns= ['Loading MFE of Sequence Complex', 'Loading Percentage of Ensemble 
Defect of RBS of Trigger-Switch Complex', 'Loading GC Content of RBS Region in 
Switch'])


Although this seems to be the correct procedure, I am not certain if I properly 
understand the output of X1=pca.fit_transform(X). One source I initially used 
ostensibly cleared the matter (source: Sklearn PCA is pca.components_ the 
loadings?<https://stackoverflow.com/questions/36380183/sklearn-pca-is-pca-components-the-loadings>),
 but upon closer inspection, I realized I wasn't sure if I was getting the 
correct values, which was described as "the result of the projection for each 
sample into the vector space spanned by the components". Furthermore, loadings 
can also be defined as being "sums of squares within each component are the 
eigenvalues (components' variances)" (source: 
https://stats.stackexchange.com/questions/92499/how-to-interpret-pca-loadings). 
I checked the Eigenvalues of my parameters using:


X_std = StandardScaler().fit_transform(X)
cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('\nEigenvalues \n%s' %eig_vals)

And then I squared and summed the loading values in each column produced by 
'X1=pca.fit_transform(X)`, and found that they did not match the Eigenvalues 
for the respective parameters at all.

It is worth noting that I understood the term "loading values" as the distance 
between a certain value and an associated component (so that values that 
influenced the slope and variance captured by the component more strongly had 
higher loading values). Am I fundamentally misunderstanding the concept of 
loading values? Or am I not using the right function from scikit-learn? I have 
tried to look through the source code for scikit-learn's pca.fit_transform, but 
I don't have the level of mathematical or coding experience required to 
understand it.


Thanks so much,

David
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to