Re: indexing big files

Jonathan_Wasson Mon, 14 Jan 2002 15:11:23 -0800


Have tried that... even going so as as to push it to a Solaris server with
plenty more RAM than my NT box... still hanging, so assume it is something
other than memory.  So far have stepped into it and it appears to be
hanging on HTMLParser parser = new HTMLParser(f); in HTMLDocument.class...
think this may have something to do with JavaCC.zip & HTLParser.jj?
Similarly org.apache.lucene.HTMLParser.Test appears to be hanging.





                                                                                       
    
                    Winton Davies                                                      
    
                    <wdavies@over        To:     "Lucene Users List"                   
    
                    ture.com>            <[EMAIL PROTECTED]>              
    
                                         cc:                                           
    
                    01/08/02             Subject:     Re: indexing big files           
    
                    05:30 PM                                                           
    
                    Please                                                             
    
                    respond to                                                         
    
                    "Lucene Users                                                      
    
                    List"                                                              
    
                                                                                       
    
                                                                                       
    




My guess is Garbage Collection -- Try allocating twice as much Heap as
before.
or more. Try running with -gc:verbose (or whatever).

  Cheers,
  Winton


>Question from a Lucene newbie... I'm trying to index a file structure
which
>happens to include a relatively large file (310kb with 55,700 words) and
>for some reason it appears to hanging the whole indexing process.  Here's
a
>quick run-down..
>
>1) Am using a webcrawler to retrieve files and copy to my local disk.
>2) For files like .pdf's... I'm copying an .html equivalent of the file to
>my disk (but leaving .pdf extension).
>3) Then later in a serperate batch process I run pretty much the standard
>out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
>.pdf as a possible indexing type).
>
>That's about it.  No big deal.  The transformation from pdf to html is not
>perfected yet either... so file size will definitely drop in the future...
>as nonsense terms are being included in these files.  But for now... what
>should I be looking at or altering to find out what is causing the hang?
>Thanks!
>
>Jon Wasson
>
>
>--
>To unsubscribe, e-mail:   <
mailto:[EMAIL PROTECTED]>
>For additional commands, e-mail: <
mailto:[EMAIL PROTECTED]>


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <
mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <
mailto:[EMAIL PROTECTED]>





--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: indexing big files

Reply via email to