Hello Herb. Thank you very much for your reply. I want to have the cosine for each a and each b. I'm using code for lucene I found online, which I will post below.
Hello Uwe. Thank you very much for replying. I am using a class DocVector and then a class in which i try to compute the similarities from documents that were indexed in two folders. Here is the code for the two classes. Could you please help me? What am I doing wrong? Thank you very much! package NewApp; import extractout.*; import java.util.Map; import org.apache.commons.math3.linear.OpenMapRealVector; import org.apache.commons.math3.linear.RealVectorFormat; import org.apache.commons.math3.linear.SparseRealVector; /** * * @author Stefy */ class DocVector { public Map<String,Integer> terms; public SparseRealVector vector; public DocVector(Map<String,Integer> terms) { this.terms = terms; this.vector = new OpenMapRealVector(terms.size()); } public void setEntry(String term, int freq) { if (terms.containsKey(term)) { int pos = terms.get(term); vector.setEntry(pos, (double) freq); } } public void normalize() { double sum = vector.getL1Norm(); vector = (SparseRealVector) vector.mapDivide(sum); } @Override public String toString() { RealVectorFormat formatter = new RealVectorFormat(); return formatter.format(vector); } } --------------------------------------------------------------------------------------- public class testCosine { static String in_B = "/local/march_exp/in_B"; static String data_B = "/local/march_exp/B_split100_EN"; static String in_A = "/local/march_exp/in_A"; static String data_A = "/local/march_exp/A_split100_EN"; static File indexDir_B, dataDir_B, indexDir_A, dataDir_A; static IndexReader reader_A, reader_B; static Directory dir_B, dir_A; static int size_B = 23992, size_A = 10995; private static double getCosineSimilarity(DocVector d1, DocVector d2) { return (d1.vector.dotProduct(d2.vector)) / (d1.vector.getNorm() * d2.vector.getNorm()); } public static void testSimilarityUsingCosine() throws Exception { indexDir_A = new File(in_A); dir_A = FSDirectory.open(indexDir_A); reader_A = IndexReader.open(dir_A); indexDir_B = new File(in_B); dir_B = FSDirectory.open(indexDir_B); reader_B = IndexReader.open(dir_B); Map<String, Integer> terms_A = new HashMap<String, Integer>(); TermEnum termEnum_A = reader_A.terms(new Term("contents")); Map<String, Integer> terms_B = new HashMap<String, Integer>(); TermEnum termEnum_B = reader_B.terms(new Term("contents")); int pos = 0; while (termEnum_A.next()) { Term term = termEnum_A.term(); if (!"contents".equals(term.field())) { break; } terms_A.put(term.text(), pos++); } pos = 0; while (termEnum_B.next()) { Term term = termEnum_B.term(); if (!"contents".equals(term.field())) { break; } terms_B.put(term.text(), pos++); } int[] docIds_A = new int[size_A]; DocVector[] docs_A = new DocVector[docIds_A.length]; int i = 0; for (int docId : docIds_A) { TermFreqVector[] tfvs = reader_A.getTermFreqVectors(docId); docs_A[i] = new DocVector(terms_A); for (TermFreqVector tfv : tfvs) { String[] termTexts = tfv.getTerms(); int[] termFreqs = tfv.getTermFrequencies(); for (int j = 0; j < termTexts.length; j++) { docs_A[i].setEntry(termTexts[j], termFreqs[j]); } } docs_A[i].normalize(); i++; } int[] docIds_B = new int[size_B]; DocVector[] docs_B = new DocVector[docIds_B.length]; i = 0; for (int docId : docIds_B) { TermFreqVector[] tfvs = reader_B.getTermFreqVectors(docId); docs_B[i] = new DocVector(terms_B); for (TermFreqVector tfv : tfvs) { String[] termTexts = tfv.getTerms(); int[] termFreqs = tfv.getTermFrequencies(); for (int j = 0; j < termTexts.length; j++) { docs_B[i].setEntry(termTexts[j], termFreqs[j]); } } docs_B[i].normalize(); } FileWriter fstream_c = new FileWriter("/local/march_exp/COS/COSINE_.txt"); BufferedWriter writer_c = new BufferedWriter(fstream_c); double[][] cosimvect = new double[size_A][size_B]; for (i = 0; i < size_A; i++) { for (int j = 0; j < size_B; j++) { cosimvect[i][j] = getCosineSimilarity(docs_A[i], docs_B[j]); System.out.println("cosine between " + i + " " + j + " is " + cosimvect[i][j]); } } writer_c.close(); reader_B.close(); reader_A.close(); dir_B.close(); dir_A.close(); } public static void main(String[] args) throws Exception { testSimilarityUsingCosine(); } } On Friday, March 21, 2014 12:14 AM, Uwe Schindler <u...@thetaphi.de> wrote: Hi Stefy, the stack trace you posted has nothing to do with Apache Lucene. It looks like you are using some commons-lang3 classes here, but no Lucene code at all. So I think your question might be better asked on the commons-math mailing list, unless you have some Lucene code around, too. If this is the case, you should give more information how you use Lucene. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Stefy D. [mailto:tsuki_st...@yahoo.com] > Sent: Thursday, March 20, 2014 10:05 PM > To: java-user@lucene.apache.org > Subject: Dimension mismatch exception > > Dear all, > > I am trying to compute the cosine similarity between several documents. I > have an indexed directory A made using 10000 files and another indexed > directory B made using 20000 files. All the indexed documents from both > directories have the same length (100 sentences). I want to get the cosine > similarity between documents from directory A and documents from > directory B. I have used the code from here but on the two indexed > directories. So I use something like getCosineSimilarity(docs_A[i], > docs_B[j]); > > I get the following error: > Exception in thread "main" > org.apache.commons.math3.exception.DimensionMismatchException: > 44,375 != 596,263 > at > org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real > Vector.java:179) > at > org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real > Vector.java:165) > at > org.apache.commons.math3.linear.RealVector.dotProduct(RealVector.java:3 > 07) > at NewApp.testCosine.getCosineSimilarity(testCosine.java:57) > > Please help me. Thank you very much! --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org