Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

Ankit Murarka Thu, 29 Aug 2013 05:07:59 -0700

Yes I know that Lucene should not have any document size limits. All Iget is a lock file inside my index folder. Along with this there's noother file inside the index folder. Then I get OOM exception.

Please provide some guidance...

Here is the example:


package com.issue;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexCommit;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.LiveIndexWriterConfig;
import org.apache.lucene.index.LogByteSizeMergePolicy;
import org.apache.lucene.index.MergePolicy;
import org.apache.lucene.index.SerialMergeScheduler;
import org.apache.lucene.index.MergePolicy.OneMerge;
import org.apache.lucene.index.MergeScheduler;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;


import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.LineNumberReader;
import java.util.Date;

public class D {

  /** Index all text files under a directory. */


    static String[] filenames;

  public static void main(String[] args) {

    //String indexPath = args[0];

    String indexPath="D:\\Issue";//Place where indexes will be created
    String docsPath="Issue";    //Place where the files are kept.
    boolean create=true;

    String ch="OverAll";


   final File docDir = new File(docsPath);
   if (!docDir.exists() || !docDir.canRead()) {

System.out.println("Document directory '"+docDir.getAbsolutePath()+ "' does not exist or is not readable, pleasecheck the path");

      System.exit(1);
    }

    Date start = new Date();
   try {
     Directory dir = FSDirectory.open(new File(indexPath));

Analyzer analyzer=newcom.rancore.demo.CustomAnalyzerForCaseSensitive(Version.LUCENE_44);IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_44,analyzer);

      iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);

      IndexWriter writer = new IndexWriter(dir, iwc);
      if(ch.equalsIgnoreCase("OverAll")){
          indexDocs(writer, docDir,true);
      }else{
          filenames=args[2].split(",");
         // indexDocs(writer, docDir);

   }
      writer.commit();
      writer.close();

    } catch (IOException e) {
      System.out.println(" caught a " + e.getClass() +
       "\n with message: " + e.getMessage());
    }
    catch(Exception e)
    {

        e.printStackTrace();
    }
 }

  //Over All
  static void indexDocs(IndexWriter writer, File file,boolean flag)
  throws IOException {

      FileInputStream fis = null;
 if (file.canRead()) {

    if (file.isDirectory()) {
     String[] files = file.list();
      // an IO error could occur
      if (files != null) {
        for (int i = 0; i < files.length; i++) {
          indexDocs(writer, new File(file, files[i]),flag);
        }
      }
   } else {
      try {
        fis = new FileInputStream(file);
     } catch (FileNotFoundException fnfe) {

       fnfe.printStackTrace();
     }

      try {

          Document doc = new Document();

Field pathField = new StringField("path", file.getPath(),Field.Store.YES);

          doc.add(pathField);

doc.add(new LongField("modified", file.lastModified(),Field.Store.NO));


          doc.add(new StringField("name",file.getName(),Field.Store.YES));

doc.add(new TextField("contents", new BufferedReader(newInputStreamReader(fis, "UTF-8"))));


          LineNumberReader lnr=new LineNumberReader(new FileReader(file));


         String line=null;
          while( null != (line = lnr.readLine()) ){
              doc.add(new StringField("SC",line.trim(),Field.Store.YES));

// doc.add(newField("contents",line,Field.Store.YES,Field.Index.ANALYZED));

if (writer.getConfig().getOpenMode() ==OpenMode.CREATE_OR_APPEND) {


            writer.addDocument(doc);
            writer.commit();
            fis.close();
          } else {
              try
              {
            writer.updateDocument(new Term("path", file.getPath()), doc);

            fis.close();

              }catch(Exception e)
              {
                  writer.close();
                   fis.close();

                  e.printStackTrace();

              }
          }

      }catch (Exception e) {
           writer.close();
            fis.close();

         e.printStackTrace();
      }finally {
          // writer.close();

        fis.close();
      }
    }
  }
}
}



On 8/29/2013 4:20 PM, Michael McCandless wrote:

Lucene doesn't have document size limits.

There are default limits for how many tokens the highlighters will process ...

But, if you are passing each line as a separate document to Lucene,
then Lucene only sees a bunch of tiny documents, right?

Can you boil this down to a small test showing the problem?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 29, 2013 at 1:51 AM, Ankit Murarka
<ankit.mura...@rancoretech.com>  wrote:

Hello all,

Faced with a typical issue.
I have many files which I am indexing.

Problem Faced:
a. File having size less than 20 MB are successfully indexed and merged.

b. File having size>20MB are not getting INDEXED.. No Exception is being
thrown. Only a lock file is being created in the index directory. The
indexing process for a single file exceeding 20 MB size continues for more
than 8 minutes after which I have a code which merge the generated index to
existing index.

Since no index is being generated now, I get an exception during merging
process.

Why Files having size greater than 20 MB are not being indexed..??.  I am
indexing each line of the file. Why IndexWriter is not throwing any error.

Do I need to change any parameter in Lucene or tweak the Lucene settings ??
Lucene version is 4.4.0

My current deployment for Lucene is on a server running with 128 MB and 512
MB heap.

--
Regards

Ankit Murarka

"What lies behind us and what lies before us are tiny matters compared with
what lies within us"


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Regards

Ankit Murarka

"What lies behind us and what lies before us are tiny matters compared with what 
lies within us"


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

Reply via email to