Re: how to index large number of files?

Ian Lea Thu, 21 Oct 2010 01:40:04 -0700

Maybe mostly 1K, but you only need 1 very large doc to cause a problem.

I haven't been following this thread, so apologies if I've missed
things, but you seem to be having problems running what should be a
simple job of the sort that lucene handles every day without  breaking
sweat.


Does it always fail on the same doc or after indexing the same number
of docs?  What exactly is the exception and stack trace?  Are you sure
the problem is with your lucene calls, not something else you are
doing in the same program?

Cutting things down to the smallest, simplest possible standalone
program or test case that demonstrates the problem often helps.  If
you do that and can't find the problem yourself, post the code here
along with the stack trace and details such as how many docs it
processes before failure and the size of the doc it fails on.


--
Ian.

On Thu, Oct 21, 2010 at 4:01 AM, Sahin Buyrukbilen
<sahin.buyrukbi...@gmail.com> wrote:
> by the ways file size is not big. mostly 1kB. I  am working on wikipedia
> articles in txt format.
>
> On Wed, Oct 20, 2010 at 11:01 PM, Sahin Buyrukbilen <
> sahin.buyrukbi...@gmail.com> wrote:
>
>> Unfortunately both methods didnt go through. I am getting memory error even
>> at reading the directory contents.
>>
>> Now, I am thinking this: What if I split 4.5million files into 100.000 (or
>> less depending on java error) files directories, index each of them
>> separately and merge those indexes(if possible).
>>
>> Any suggestions?
>>
>>
>> On Wed, Oct 20, 2010 at 5:47 PM, Erick Erickson 
>> <erickerick...@gmail.com>wrote:
>>
>>> My first guess is that you're accumulating too many documents in
>>> before the flush gets triggered. The quick-n-dirty way to test this is
>>> to do an IndexWriter.flush after every addDocument. This will slow
>>> down indexing, but it will also tell you whether this is the problem and
>>> you can look for more elegant solutions...
>>>
>>> You can also get some stats by IndexWriter.getRamBufferSizeMB, and
>>> can force automatic flushes given a particular ram buffer size via
>>> IndexWriter.setRamBufferSizeMB.
>>>
>>> One thing I'd be interested in is how big your files are. It might be, are
>>> you trying to process a humongous file when it blows?
>>>
>>> And if none of that helps, please post your stack trace.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Oct 20, 2010 at 2:45 PM, Sahin Buyrukbilen <
>>> sahin.buyrukbi...@gmail.com> wrote:
>>>
>>> > with the different parameters I still got the same error. My code is
>>> very
>>> > simple, indeed I am only concerned with creating the index and then I
>>> will
>>> > do some private information retrieval experiments on the inverted index
>>> > file, which I created with the information extracted from the index.
>>> That
>>> > is
>>> > why I didnt go over optimization until now. the database size I had was
>>> ery
>>> > small compared to 4.5million.
>>> > My code is as follows:
>>> >
>>> > public static void createIndex() throws CorruptIndexException,
>>> > LockObtainFailedException, IOException {
>>> >        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
>>> >        Directory indexDir = FSDirectory.open(new
>>> > File("/media/work/WIKI/indexes/"));
>>> >        boolean recreateIndexIfExists = true;
>>> >        IndexWriter indexWriter = new IndexWriter(indexDir, analyzer,
>>> > recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED);
>>> >        indexWriter.setUseCompoundFile(false);
>>> >        File dir = new File(FILES_TO_INDEX_DIRECTORY);
>>> >        File[] files = dir.listFiles();
>>> >        for (File file : files) {
>>> >            Document document = new Document();
>>> >
>>> >            //String path = file.getCanonicalPath();
>>> >            //document.add(new Field(FIELD_PATH, path, Field.Store.YES,
>>> > Field.Index.NOT_ANALYZED));
>>> >
>>> >            Reader reader = new FileReader(file);
>>> >            document.add(new Field(FIELD_CONTENTS, reader));
>>> >
>>> >            indexWriter.addDocument(document);
>>> >        }
>>> >        indexWriter.optimize();
>>> >        indexWriter.close();
>>> >     }
>>> >
>>> >
>>> > On Wed, Oct 20, 2010 at 2:39 PM, Qi Li <aler...@gmail.com> wrote:
>>> >
>>> > > 1. What is the difference when you used different vm parameters?
>>> > > 2  What merge policy and optimization strategy did you use?
>>> > > 3. How did you use the commit or flush ?
>>> > >
>>> > > Qi
>>> > >
>>> > > On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen <
>>> > > sahin.buyrukbi...@gmail.com> wrote:
>>> > >
>>> > > > Thank you so much for this infor. it looks pretty complicated for me
>>> > but
>>> > > I
>>> > > > will try.
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang <
>>> johnbin.w...@gmail.com
>>> > > > >wrote:
>>> > > >
>>> > > > > You can start a fixedThreadPool to index all these files in the
>>> > > multhread
>>> > > > > way. Every thread execute an index task which could index a part
>>> of
>>> > all
>>> > > > the
>>> > > > > files. In the index task, when indexing 10000 files, you need
>>> execute
>>> > > the
>>> > > > > indexWrite.commit() method to flush all the index add operation to
>>> > disk
>>> > > > > file.
>>> > > > >
>>> > > > > If you need index all these files into only one index file, you
>>> need
>>> > to
>>> > > > > hold
>>> > > > > only one indexWriter instance among all the index thread.
>>> > > > >
>>> > > > > Hope it's helpful.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > On Wed, Oct 20, 2010 at 1:05 PM, Sahin Buyrukbilen <
>>> > > > > sahin.buyrukbi...@gmail.com> wrote:
>>> > > > >
>>> > > > > > Thank you Johnbin,
>>> > > > > > do you know which parameter I have to play with?
>>> > > > > >
>>> > > > > > On Wed, Oct 20, 2010 at 12:59 AM, Johnbin Wang <
>>> > > johnbin.w...@gmail.com
>>> > > > > > >wrote:
>>> > > > > >
>>> > > > > > > I think you can write index file once every 10,000 files or
>>> less
>>> > > have
>>> > > > > > been
>>> > > > > > > read.
>>> > > > > > >
>>> > > > > > > On Wed, Oct 20, 2010 at 12:11 PM, Sahin Buyrukbilen <
>>> > > > > > > sahin.buyrukbi...@gmail.com> wrote:
>>> > > > > > >
>>> > > > > > > > Hi all,
>>> > > > > > > >
>>> > > > > > > > I have to index about 4.5Million txt files. When I run the
>>> my
>>> > > > > indexing
>>> > > > > > > > application through Eclipse, I get this error : "Exception
>>> in
>>> > > > thread
>>> > > > > > > "main"
>>> > > > > > > > java.lang.OutOfMemoryError: Java heap space"
>>> > > > > > > >
>>> > > > > > > > eclipse -vmargs -Xmx2000m -Xss8192k
>>> > > > > > > >
>>> > > > > > > > eclipse -vmargs -Xms40M -Xmx2G
>>> > > > > > > >
>>> > > > > > > >  I tried running Eclipse with above memory parameters, but
>>> > still
>>> > > > had
>>> > > > > > the
>>> > > > > > > > same error. The architecture of my computer is AMD x2 64bit
>>> > 2GHz
>>> > > > > > > processor,
>>> > > > > > > > Ubuntu 10.04 LTS 64bit. java-6-openjdk.
>>> > > > > > > >
>>> > > > > > > > Anybody has a suggestion?
>>> > > > > > > >
>>> > > > > > > > thank you.
>>> > > > > > > > Sahin.
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > --
>>> > > > > > > cheers,
>>> > > > > > > Johnbin Wang
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > --
>>> > > > > cheers,
>>> > > > > Johnbin Wang
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: how to index large number of files?

Reply via email to