Hetan, If you are using a corpus with multiple editors, I suggest that you use a cleaner like tidy as there might be weird stuff appearing in the html.
sv On Thu, 26 Aug 2004, Karthik N S wrote: > Hi Hetan > > > Th's the major Problem of non Standatrdized Tags for HTML Document's > u are Indexing ,resulting in lag time taken for Indexing process.... > > > If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html' > file > [U have to have some Knowledge of JAVACC for this]. > > > > Karthik > > -----Original Message----- > From: Hetan Shah [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 26, 2004 3:01 AM > To: Lucene Users List > Subject: Time to index documents > > > Hello all, > > Is there a way to reduce the indexing time taken when the indexer is > indexing about 30,000 + files. It is roughly taking around 6-7 hours to > do this. I am using IndexHTML class to create the index out of HTML files. > > Another issue that I see is every once in a while I get the following > output on the screen. > > adding ../31/1104852.html > Parse Aborted: Encountered "\"" at line 7, column 1. > Was expecting one of: > <ArgName> ... > "=" ... > <TagEnd> ... > > Any suggestions on preventing this from happening? > > Thanks in advance. > -H > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
