Re: how to index large number of files?

Qi Li Wed, 20 Oct 2010 13:40:38 -0700

If I were you, I would write like this.  Not sure this helps.   Let me how
it works


public static void createIndex() throws CorruptIndexException,
LockObtainFailedException, IOException {
       Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
       Directory indexDir = FSDirectory.open(new
File("/media/work/WIKI/indexes/"));
       boolean recreateIndexIfExists = true;
       IndexWriter indexWriter = new IndexWriter(indexDir, analyzer,
recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED);

       File dir = new File(FILES_TO_INDEX_DIRECTORY);
       File[] files = dir.listFiles();
       for (File file : files) {
           Document document = new Document();

           //String path = file.getCanonicalPath();
           document.add(new Field(FIELD_NAME, Field_Value, Field.Store.YES,
Field.Index.NOT_ANALYZED));
           indexWriter.addDocument(document);
       }
       indexWriter.close();
   }


Good Luck.
Qi Li


On Wed, Oct 20, 2010 at 2:45 PM, Sahin Buyrukbilen <
[email protected]> wrote:

> with the different parameters I still got the same error. My code is very
> simple, indeed I am only concerned with creating the index and then I will
> do some private information retrieval experiments on the inverted index
> file, which I created with the information extracted from the index. That
> is
> why I didnt go over optimization until now. the database size I had was ery
> small compared to 4.5million.
> My code is as follows:
>
> public static void createIndex() throws CorruptIndexException,
> LockObtainFailedException, IOException {
>        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
>        Directory indexDir = FSDirectory.open(new
> File("/media/work/WIKI/indexes/"));
>        boolean recreateIndexIfExists = true;
>        IndexWriter indexWriter = new IndexWriter(indexDir, analyzer,
> recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED);
>        indexWriter.setUseCompoundFile(false);
>        File dir = new File(FILES_TO_INDEX_DIRECTORY);
>        File[] files = dir.listFiles();
>        for (File file : files) {
>            Document document = new Document();
>
>            //String path = file.getCanonicalPath();
>            //document.add(new Field(FIELD_PATH, path, Field.Store.YES,
> Field.Index.NOT_ANALYZED));
>
>            Reader reader = new FileReader(file);
>            document.add(new Field(FIELD_CONTENTS, reader));
>
>            indexWriter.addDocument(document);
>        }
>        indexWriter.optimize();
>        indexWriter.close();
>     }
>
>
> On Wed, Oct 20, 2010 at 2:39 PM, Qi Li <[email protected]> wrote:
>
> > 1. What is the difference when you used different vm parameters?
> > 2  What merge policy and optimization strategy did you use?
> > 3. How did you use the commit or flush ?
> >
> > Qi
> >
> > On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen <
> > [email protected]> wrote:
> >
> > > Thank you so much for this infor. it looks pretty complicated for me
> but
> > I
> > > will try.
> > >
> > >
> > >
> > > On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang <[email protected]
> > > >wrote:
> > >
> > > > You can start a fixedThreadPool to index all these files in the
> > multhread
> > > > way. Every thread execute an index task which could index a part of
> all
> > > the
> > > > files. In the index task, when indexing 10000 files, you need execute
> > the
> > > > indexWrite.commit() method to flush all the index add operation to
> disk
> > > > file.
> > > >
> > > > If you need index all these files into only one index file, you need
> to
> > > > hold
> > > > only one indexWriter instance among all the index thread.
> > > >
> > > > Hope it's helpful.
> > > >
> > > >
> > > >
> > > > On Wed, Oct 20, 2010 at 1:05 PM, Sahin Buyrukbilen <
> > > > [email protected]> wrote:
> > > >
> > > > > Thank you Johnbin,
> > > > > do you know which parameter I have to play with?
> > > > >
> > > > > On Wed, Oct 20, 2010 at 12:59 AM, Johnbin Wang <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I think you can write index file once every 10,000 files or less
> > have
> > > > > been
> > > > > > read.
> > > > > >
> > > > > > On Wed, Oct 20, 2010 at 12:11 PM, Sahin Buyrukbilen <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I have to index about 4.5Million txt files. When I run the my
> > > > indexing
> > > > > > > application through Eclipse, I get this error : "Exception in
> > > thread
> > > > > > "main"
> > > > > > > java.lang.OutOfMemoryError: Java heap space"
> > > > > > >
> > > > > > > eclipse -vmargs -Xmx2000m -Xss8192k
> > > > > > >
> > > > > > > eclipse -vmargs -Xms40M -Xmx2G
> > > > > > >
> > > > > > >  I tried running Eclipse with above memory parameters, but
> still
> > > had
> > > > > the
> > > > > > > same error. The architecture of my computer is AMD x2 64bit
> 2GHz
> > > > > > processor,
> > > > > > > Ubuntu 10.04 LTS 64bit. java-6-openjdk.
> > > > > > >
> > > > > > > Anybody has a suggestion?
> > > > > > >
> > > > > > > thank you.
> > > > > > > Sahin.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > cheers,
> > > > > > Johnbin Wang
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > cheers,
> > > > Johnbin Wang
> > > >
> > >
> >
>

Re: how to index large number of files?

Reply via email to