Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

Ian Lea Thu, 29 Aug 2013 05:49:25 -0700

So you do get an exception after all, OOM.

Try it without this line:


doc.add(new TextField("contents", new BufferedReader(new
InputStreamReader(fis, "UTF-8"))));

I think that will slurp the whole file in one go which will obviously
need more memory on larger files than on smaller ones.

Or just run the program with more memory,


--
Ian.


On Thu, Aug 29, 2013 at 1:05 PM, Ankit Murarka
<[email protected]> wrote:
> Yes I know that Lucene should not have any document size limits. All I get
> is a lock file inside my index folder. Along with this there's no other file
> inside the index folder. Then I get OOM exception.
> Please provide some guidance...
>
> Here is the example:
>
> package com.issue;
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.LongField;
> import org.apache.lucene.document.StringField;
> import org.apache.lucene.document.TextField;
> import org.apache.lucene.index.IndexCommit;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig.OpenMode;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.index.LiveIndexWriterConfig;
> import org.apache.lucene.index.LogByteSizeMergePolicy;
> import org.apache.lucene.index.MergePolicy;
> import org.apache.lucene.index.SerialMergeScheduler;
> import org.apache.lucene.index.MergePolicy.OneMerge;
> import org.apache.lucene.index.MergeScheduler;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.util.Version;
>
>
> import java.io.BufferedReader;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileNotFoundException;
> import java.io.FileReader;
> import java.io.IOException;
> import java.io.InputStreamReader;
> import java.io.LineNumberReader;
> import java.util.Date;
>
> public class D {
>
>   /** Index all text files under a directory. */
>
>
>     static String[] filenames;
>
>   public static void main(String[] args) {
>
>     //String indexPath = args[0];
>
>     String indexPath="D:\\Issue";//Place where indexes will be created
>     String docsPath="Issue";    //Place where the files are kept.
>     boolean create=true;
>
>     String ch="OverAll";
>
>
>    final File docDir = new File(docsPath);
>    if (!docDir.exists() || !docDir.canRead()) {
>       System.out.println("Document directory '" +docDir.getAbsolutePath()+
> "' does not exist or is not readable, please check the path");
>       System.exit(1);
>     }
>
>     Date start = new Date();
>    try {
>      Directory dir = FSDirectory.open(new File(indexPath));
>      Analyzer analyzer=new
> com.rancore.demo.CustomAnalyzerForCaseSensitive(Version.LUCENE_44);
>      IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_44,
> analyzer);
>       iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
>
>       IndexWriter writer = new IndexWriter(dir, iwc);
>       if(ch.equalsIgnoreCase("OverAll")){
>           indexDocs(writer, docDir,true);
>       }else{
>           filenames=args[2].split(",");
>          // indexDocs(writer, docDir);
>
>    }
>       writer.commit();
>       writer.close();
>
>     } catch (IOException e) {
>       System.out.println(" caught a " + e.getClass() +
>        "\n with message: " + e.getMessage());
>     }
>     catch(Exception e)
>     {
>
>         e.printStackTrace();
>     }
>  }
>
>   //Over All
>   static void indexDocs(IndexWriter writer, File file,boolean flag)
>   throws IOException {
>
>       FileInputStream fis = null;
>  if (file.canRead()) {
>
>     if (file.isDirectory()) {
>      String[] files = file.list();
>       // an IO error could occur
>       if (files != null) {
>         for (int i = 0; i < files.length; i++) {
>           indexDocs(writer, new File(file, files[i]),flag);
>         }
>       }
>    } else {
>       try {
>         fis = new FileInputStream(file);
>      } catch (FileNotFoundException fnfe) {
>
>        fnfe.printStackTrace();
>      }
>
>       try {
>
>           Document doc = new Document();
>
>           Field pathField = new StringField("path", file.getPath(),
> Field.Store.YES);
>           doc.add(pathField);
>
>           doc.add(new LongField("modified", file.lastModified(),
> Field.Store.NO));
>
>           doc.add(new StringField("name",file.getName(),Field.Store.YES));
>
>          doc.add(new TextField("contents", new BufferedReader(new
> InputStreamReader(fis, "UTF-8"))));
>
>           LineNumberReader lnr=new LineNumberReader(new FileReader(file));
>
>
>          String line=null;
>           while( null != (line = lnr.readLine()) ){
>               doc.add(new StringField("SC",line.trim(),Field.Store.YES));
>              // doc.add(new
> Field("contents",line,Field.Store.YES,Field.Index.ANALYZED));
>           }
>
>           if (writer.getConfig().getOpenMode() == OpenMode.CREATE_OR_APPEND)
> {
>
>             writer.addDocument(doc);
>             writer.commit();
>             fis.close();
>           } else {
>               try
>               {
>             writer.updateDocument(new Term("path", file.getPath()), doc);
>
>             fis.close();
>
>               }catch(Exception e)
>               {
>                   writer.close();
>                    fis.close();
>
>                   e.printStackTrace();
>
>               }
>           }
>
>       }catch (Exception e) {
>            writer.close();
>             fis.close();
>
>          e.printStackTrace();
>       }finally {
>           // writer.close();
>
>         fis.close();
>       }
>     }
>   }
> }
> }
>
>
>
> On 8/29/2013 4:20 PM, Michael McCandless wrote:
>>
>> Lucene doesn't have document size limits.
>>
>> There are default limits for how many tokens the highlighters will process
>> ...
>>
>> But, if you are passing each line as a separate document to Lucene,
>> then Lucene only sees a bunch of tiny documents, right?
>>
>> Can you boil this down to a small test showing the problem?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 29, 2013 at 1:51 AM, Ankit Murarka
>> <[email protected]>  wrote:
>>
>>>
>>> Hello all,
>>>
>>> Faced with a typical issue.
>>> I have many files which I am indexing.
>>>
>>> Problem Faced:
>>> a. File having size less than 20 MB are successfully indexed and merged.
>>>
>>> b. File having size>20MB are not getting INDEXED.. No Exception is being
>>> thrown. Only a lock file is being created in the index directory. The
>>> indexing process for a single file exceeding 20 MB size continues for
>>> more
>>> than 8 minutes after which I have a code which merge the generated index
>>> to
>>> existing index.
>>>
>>> Since no index is being generated now, I get an exception during merging
>>> process.
>>>
>>> Why Files having size greater than 20 MB are not being indexed..??.  I am
>>> indexing each line of the file. Why IndexWriter is not throwing any
>>> error.
>>>
>>> Do I need to change any parameter in Lucene or tweak the Lucene settings
>>> ??
>>> Lucene version is 4.4.0
>>>
>>> My current deployment for Lucene is on a server running with 128 MB and
>>> 512
>>> MB heap.
>>>
>>> --
>>> Regards
>>>
>>> Ankit Murarka
>>>
>>> "What lies behind us and what lies before us are tiny matters compared
>>> with
>>> what lies within us"
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>
>
>
> --
> Regards
>
> Ankit Murarka
>
> "What lies behind us and what lies before us are tiny matters compared with
> what lies within us"
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

Reply via email to