Hello Herb. Thank you very much for your reply. I want to have the cosine for 
each a and each b. I'm using code for lucene I found online, which I will post 
below.

Hello Uwe. Thank you very much for replying. I am using a class DocVector and 
then a class in which i try to compute the similarities from documents that 
were indexed in two folders. Here is the code for the two classes. 

Could you please help me? What am I doing wrong? Thank you very much!

package NewApp;

import extractout.*;
import java.util.Map;
import org.apache.commons.math3.linear.OpenMapRealVector;
import org.apache.commons.math3.linear.RealVectorFormat;
import org.apache.commons.math3.linear.SparseRealVector;

/**
 *
 * @author Stefy
 */
class DocVector {
    
     public Map<String,Integer> terms;
      public SparseRealVector vector;
      
      public DocVector(Map<String,Integer> terms) {
        this.terms = terms;
        this.vector = new OpenMapRealVector(terms.size());
      }
      
      public void setEntry(String term, int freq) {
        if (terms.containsKey(term)) {
          int pos = terms.get(term);
          vector.setEntry(pos, (double) freq);
        }
      }
      
      public void normalize() {
        double sum = vector.getL1Norm();
        vector = (SparseRealVector) vector.mapDivide(sum);
      }
      
    @Override
      public String toString() {
        RealVectorFormat formatter = new RealVectorFormat();
        return formatter.format(vector);
      }
}

---------------------------------------------------------------------------------------
public class testCosine {

    static String in_B = "/local/march_exp/in_B";
    static String data_B = "/local/march_exp/B_split100_EN";
    static String in_A = "/local/march_exp/in_A";
    static String data_A = "/local/march_exp/A_split100_EN";
    static File indexDir_B, dataDir_B, indexDir_A, dataDir_A;
    static IndexReader reader_A, reader_B;
    static Directory dir_B, dir_A;
    static int size_B = 23992, size_A = 10995;

    private static double getCosineSimilarity(DocVector d1, DocVector d2) {
        return (d1.vector.dotProduct(d2.vector))
                / (d1.vector.getNorm() * d2.vector.getNorm());
    }

    public static void testSimilarityUsingCosine() throws Exception {

        indexDir_A = new File(in_A);
        dir_A = FSDirectory.open(indexDir_A);
        reader_A = IndexReader.open(dir_A);

        indexDir_B = new File(in_B);
        dir_B = FSDirectory.open(indexDir_B);
        reader_B = IndexReader.open(dir_B);

        Map<String, Integer> terms_A = new HashMap<String, Integer>();
        TermEnum termEnum_A = reader_A.terms(new Term("contents"));
        Map<String, Integer> terms_B = new HashMap<String, Integer>();
        TermEnum termEnum_B = reader_B.terms(new Term("contents"));

        int pos = 0;
        while (termEnum_A.next()) {
            Term term = termEnum_A.term();
            if (!"contents".equals(term.field())) {
                break;
            }
            terms_A.put(term.text(), pos++);
        }

        pos = 0;
        while (termEnum_B.next()) {
            Term term = termEnum_B.term();
            if (!"contents".equals(term.field())) {
                break;
            }
            terms_B.put(term.text(), pos++);
        }


        int[] docIds_A = new int[size_A];
        DocVector[] docs_A = new DocVector[docIds_A.length];
        int i = 0;
        for (int docId : docIds_A) {
            TermFreqVector[] tfvs = reader_A.getTermFreqVectors(docId);
            docs_A[i] = new DocVector(terms_A);
            for (TermFreqVector tfv : tfvs) {
                String[] termTexts = tfv.getTerms();
                int[] termFreqs = tfv.getTermFrequencies();
                for (int j = 0; j < termTexts.length; j++) {
                    docs_A[i].setEntry(termTexts[j], termFreqs[j]);
                }
            }
            docs_A[i].normalize();
            i++;
        }

        int[] docIds_B = new int[size_B];
        DocVector[] docs_B = new DocVector[docIds_B.length];
        i = 0;
        for (int docId : docIds_B) {
            TermFreqVector[] tfvs = reader_B.getTermFreqVectors(docId);
            docs_B[i] = new DocVector(terms_B);
            for (TermFreqVector tfv : tfvs) {
                String[] termTexts = tfv.getTerms();
                int[] termFreqs = tfv.getTermFrequencies();
                for (int j = 0; j < termTexts.length; j++) {
                    docs_B[i].setEntry(termTexts[j], termFreqs[j]);
                }
            }
            docs_B[i].normalize();
        }

        FileWriter fstream_c = new 
FileWriter("/local/march_exp/COS/COSINE_.txt");
        BufferedWriter writer_c = new BufferedWriter(fstream_c);

        double[][] cosimvect = new double[size_A][size_B];
        for (i = 0; i < size_A; i++) {
            for (int j = 0; j < size_B; j++) {
                cosimvect[i][j] = getCosineSimilarity(docs_A[i], docs_B[j]);
                System.out.println("cosine between " + i + " " + j + " is " + 
cosimvect[i][j]);
            }
        }
        writer_c.close();
        reader_B.close();
        reader_A.close();
        dir_B.close();
        dir_A.close();
    }

    public static void main(String[] args) throws Exception {

        testSimilarityUsingCosine();
    }
}
 





On Friday, March 21, 2014 12:14 AM, Uwe Schindler <u...@thetaphi.de> wrote:
 
Hi Stefy,

the stack trace you posted has nothing to do with Apache Lucene. It looks like 
you are using some commons-lang3 classes here, but no Lucene code at all. So I 
think your question might be better asked on the commons-math mailing list, 
unless you have some Lucene code around, too. If this is the case, you should 
give more information how you use Lucene.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



> -----Original Message-----
> From: Stefy D. [mailto:tsuki_st...@yahoo.com]
> Sent: Thursday, March 20, 2014 10:05 PM
> To: java-user@lucene.apache.org
> Subject: Dimension mismatch exception
> 
> Dear all,
> 
> I am trying to compute the cosine similarity between several documents. I
> have an indexed directory A made using 10000 files and another indexed
> directory B made using 20000 files. All the indexed documents from both
> directories have the same length (100 sentences). I want to get the cosine
> similarity between documents from directory A and documents from
> directory B. I have used the code from here but on the two indexed
> directories. So I use something like getCosineSimilarity(docs_A[i], 
> docs_B[j]);
> 
> I get the following error:
> Exception in thread "main"
> org.apache.commons.math3.exception.DimensionMismatchException:
> 44,375 != 596,263
>     at
> org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real
> Vector.java:179)
>     at
> org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real
> Vector.java:165)
>     at
> org.apache.commons.math3.linear.RealVector.dotProduct(RealVector.java:3
> 07)
>     at NewApp.testCosine.getCosineSimilarity(testCosine.java:57)
> 
> Please help me. Thank you very much!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to