Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

Ankit Murarka Fri, 30 Aug 2013 08:15:32 -0700

Can someone please suggest what might be the possible resolution for theissue mentioned in trailing mail::

Also now on changing some settings for IndexWriterConfig andLiveIndexWriterConfig I get the following exception:



20:31:23,540 INFO  java.lang.OutOfMemoryError: Java heap space

20:31:23,540 INFO atorg.apache.lucene.util.UnicodeUtil.UTF16toUTF8WithHash(UnicodeUtil.java:136)20:31:23,540 INFO atorg.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.fillBytesRef(CharTermAttributeImpl.java:91)20:31:23,541 INFO atorg.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)20:31:23,541 INFO atorg.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:165)20:31:23,541 INFO atorg.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:245)20:31:23,542 INFO atorg.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:265)20:31:23,542 INFO atorg.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:432)20:31:23,542 INFO atorg.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1513)20:31:23,542 INFO atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1188)20:31:23,543 INFO atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1169)20:31:23,543 INFO atcom.rancore.MainClass1.indexDocs(MainClass1.java:220)20:31:23,543 INFO atcom.rancore.MainClass1.indexDocs(MainClass1.java:167)

20:31:23,543 INFO      at com.rancore.MainClass1.main(MainClass1.java:110)

20:31:23,546 INFO java.lang.IllegalStateException: this writer hit anOutOfMemoryError; cannot commit20:31:23,546 INFO atorg.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2726)20:31:23,546 INFO atorg.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2897)20:31:23,546 INFO atorg.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2872)

20:31:23,547 INFO      at com.rancore.MainClass1.main(MainClass1.java:136)

Can anyone please guide....
There has to be some way how a file of say 20 MB can be properly indexed...

Any guidance is highly appreciated..


On 8/30/2013 6:49 PM, Ankit Murarka wrote:

Hello,
The following exception is being printed on the server console whentrying to index. As usual, indexes are not getting created.
java.lang.OutOfMemoryError: Java heap space
atorg.apache.lucene.util.AttributeSource.<init>(AttributeSource.java:148)atorg.apache.lucene.util.AttributeSource.<init>(AttributeSource.java:128)18:42:21,764 INFO atorg.apache.lucene.analysis.TokenStream.<init>(TokenStream.java:91)18:42:21,765 INFO atorg.apache.lucene.document.Field$StringTokenStream.<init>(Field.java:568)18:42:21,765 INFO atorg.apache.lucene.document.Field.tokenStream(Field.java:541)18:42:21,765 INFO atorg.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:95)18:42:21,766 INFO atorg.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:245)18:42:21,766 INFO atorg.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:265)18:42:21,766 INFO atorg.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:432)18:42:21,767 INFO atorg.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1513)18:42:21,767 INFO atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1188)18:42:21,767 INFO atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1169)18:42:21,768 INFO atcom.rancore.MainClass1.indexDocs(MainClass1.java:197)18:42:21,768 INFO atcom.rancore.MainClass1.indexDocs(MainClass1.java:153)
18:42:21,768 INFO      at com.rancore.MainClass1.main(MainClass1.java:95)
18:42:21,771 INFO java.lang.IllegalStateException: this writer hit anOutOfMemoryError; cannot commit18:42:21,772 INFO atorg.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2726)18:42:21,911 INFO atorg.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2897)18:42:21,911 INFO atorg.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2872)
18:42:21,912 INFO     at com.rancore.MainClass1.main(MainClass1.java:122)
18:42:22,008 INFO  Indexing to directory
Any guidance will be highly appreciated...>!!!!... Server Opts are-server -Xms8192m -Xmx16384m -XX:MaxPermSize=512m
On 8/30/2013 3:13 PM, Ankit Murarka wrote:
Hello.
The server has much more memory. I have given minimum 8 GB toApplication Server..
The Java opts which are of interest is : -server -Xms8192m-Xmx16384m -XX:MaxPermSize=8192m
Even after giving this much memory to the server, how come i amhitting OOM exceptions. No other activity is being performed on theserver apart from this.
Checking from JConsole, the maximum Heap during indexing was close to1.2 GB whereas the memory allocated is as mentioned above,.
I did mentioned 128MB also but this is when I start the server on anormal windows machine.
Isn't there any property/configuration in LUCENE which I should do inorder to index large files. Say about 30 MB.. I read somethingMergeFactor and etc. but was not able to set any value for it. Don'teven know whether doing that will help the cause..
On 8/29/2013 7:04 PM, Ian Lea wrote:
Well, I use neither Eclipse nor your application server and can offer
no advice on any differences in behaviour between the two.  Maybe you
should try Eclipse or app server forums.

If you are going to index the complete contents of a file as one field
you are likely to hit OOM exceptions.  How big is the largest file you
are ever going to index?

The server may have 8GB but how much memory are you allowing the JVM?
What are the command line flags?  I think you mentioned 128Mb in an
earlier email.  That isn't much.


--
Ian.



On Thu, Aug 29, 2013 at 2:14 PM, Ankit Murarka
<[email protected]>  wrote:
Hello,
          I get exception only when the code is fired from Eclipse.
When it is deployed on an application server, I get no exception atall.This forced me to invoke the same code from Eclipse and check whatis the
issue.,.

I ran the code on server with 8 GB memory.. Even then no exception
occurred....!!.. Only write.lock is formed..
Removing contents field is not desirable as this is needed forsearch to
work perfectly...

On 8/29/2013 6:17 PM, Ian Lea wrote:
So you do get an exception after all, OOM.

Try it without this line:

doc.add(new TextField("contents", new BufferedReader(new
InputStreamReader(fis, "UTF-8"))));

I think that will slurp the whole file in one go which will obviously
need more memory on larger files than on smaller ones.

Or just run the program with more memory,


--
Ian.


On Thu, Aug 29, 2013 at 1:05 PM, Ankit Murarka
<[email protected]>   wrote:
Yes I know that Lucene should not have any document size limits.All I
get
is a lock file inside my index folder. Along with this there's noother
file
inside the index folder. Then I get OOM exception.
Please provide some guidance...

Here is the example:

package com.issue;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexCommit;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.LiveIndexWriterConfig;
import org.apache.lucene.index.LogByteSizeMergePolicy;
import org.apache.lucene.index.MergePolicy;
import org.apache.lucene.index.SerialMergeScheduler;
import org.apache.lucene.index.MergePolicy.OneMerge;
import org.apache.lucene.index.MergeScheduler;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;


import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.LineNumberReader;
import java.util.Date;

public class D {

    /** Index all text files under a directory. */


      static String[] filenames;

    public static void main(String[] args) {

      //String indexPath = args[0];
String indexPath="D:\\Issue";//Place where indexes will becreated
      String docsPath="Issue";    //Place where the files are kept.
      boolean create=true;

      String ch="OverAll";


     final File docDir = new File(docsPath);
     if (!docDir.exists() || !docDir.canRead()) {
        System.out.println("Document directory '"
+docDir.getAbsolutePath()+
"' does not exist or is not readable, please check the path");
        System.exit(1);
      }

      Date start = new Date();
     try {
       Directory dir = FSDirectory.open(new File(indexPath));
       Analyzer analyzer=new
com.rancore.demo.CustomAnalyzerForCaseSensitive(Version.LUCENE_44);
IndexWriterConfig iwc = newIndexWriterConfig(Version.LUCENE_44,
analyzer);
        iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);

        IndexWriter writer = new IndexWriter(dir, iwc);
        if(ch.equalsIgnoreCase("OverAll")){
            indexDocs(writer, docDir,true);
        }else{
            filenames=args[2].split(",");
           // indexDocs(writer, docDir);

     }
        writer.commit();
        writer.close();

      } catch (IOException e) {
        System.out.println(" caught a " + e.getClass() +
         "\n with message: " + e.getMessage());
      }
      catch(Exception e)
      {

          e.printStackTrace();
      }
   }

    //Over All
static void indexDocs(IndexWriter writer, File file,booleanflag)
    throws IOException {

        FileInputStream fis = null;
   if (file.canRead()) {

      if (file.isDirectory()) {
       String[] files = file.list();
        // an IO error could occur
        if (files != null) {
          for (int i = 0; i<   files.length; i++) {
            indexDocs(writer, new File(file, files[i]),flag);
          }
        }
     } else {
        try {
          fis = new FileInputStream(file);
       } catch (FileNotFoundException fnfe) {

         fnfe.printStackTrace();
       }

        try {

            Document doc = new Document();
Field pathField = new StringField("path",file.getPath(),
Field.Store.YES);
            doc.add(pathField);

            doc.add(new LongField("modified", file.lastModified(),
Field.Store.NO));

            doc.add(new
StringField("name",file.getName(),Field.Store.YES));

           doc.add(new TextField("contents", new BufferedReader(new
InputStreamReader(fis, "UTF-8"))));

            LineNumberReader lnr=new LineNumberReader(new
FileReader(file));


           String line=null;
            while( null != (line = lnr.readLine()) ){
                doc.add(new
StringField("SC",line.trim(),Field.Store.YES));
               // doc.add(new
Field("contents",line,Field.Store.YES,Field.Index.ANALYZED));
            }

            if (writer.getConfig().getOpenMode() ==
OpenMode.CREATE_OR_APPEND)
{

              writer.addDocument(doc);
              writer.commit();
              fis.close();
            } else {
                try
                {
writer.updateDocument(new Term("path",file.getPath()),
doc);

              fis.close();

                }catch(Exception e)
                {
                    writer.close();
                     fis.close();

                    e.printStackTrace();

                }
            }

        }catch (Exception e) {
             writer.close();
              fis.close();

           e.printStackTrace();
        }finally {
            // writer.close();

          fis.close();
        }
      }
    }
}
}



On 8/29/2013 4:20 PM, Michael McCandless wrote:
Lucene doesn't have document size limits.

There are default limits for how many tokens the highlighters will
process
...

But, if you are passing each line as a separate document to Lucene,
then Lucene only sees a bunch of tiny documents, right?

Can you boil this down to a small test showing the problem?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 29, 2013 at 1:51 AM, Ankit Murarka
<[email protected]>    wrote:
Hello all,

Faced with a typical issue.
I have many files which I am indexing.

Problem Faced:
a. File having size less than 20 MB are successfully indexed and
merged.

b. File having size>20MB are not getting INDEXED.. No Exception is
being
thrown. Only a lock file is being created in the indexdirectory. Theindexing process for a single file exceeding 20 MB sizecontinues for
more
than 8 minutes after which I have a code which merge the generated
index
to
existing index.

Since no index is being generated now, I get an exception during
merging
process.
Why Files having size greater than 20 MB are not beingindexed..??. I
am
indexing each line of the file. Why IndexWriter is not throwingany
error.

Do I need to change any parameter in Lucene or tweak the Lucene
settings
??
Lucene version is 4.4.0
My current deployment for Lucene is on a server running with128 MB and
512
MB heap.

--
Regards

Ankit Murarka
"What lies behind us and what lies before us are tiny matterscompared
with
what lies within us"
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--
Regards

Ankit Murarka
"What lies behind us and what lies before us are tiny matterscompared
with
what lies within us"
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--
Regards

Ankit Murarka
"What lies behind us and what lies before us are tiny matterscompared with
what lies within us"


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



--
Regards

Ankit Murarka

"What lies behind us and what lies before us are tiny matters compared with what 
lies within us"


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

Reply via email to