JGuru explanation: http://www.jguru.com/faq/view.jsp?EID=1074228
I have no sample code for neko, I think nutch uses it though. For tidy, you can look at ant in the sandbox: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3&view=markup HTH, sv On Wed, 25 Aug 2004, Hetan Shah wrote: > Do you have any pointers for sample code for them? > Would highly appreciate it. > Thanks. > -H > > Stephane James Vaucher wrote: > > > I don't think that the demo parser is meant as a production > > system component. You can look at Tidy or NekoHtml. They cleanup your html > > and are probably optimised. > > > > sv > > > > On Wed, 25 Aug 2004, Hetan Shah wrote: > > > > > >>Hello all, > >> > >>Is there a way to reduce the indexing time taken when the indexer is > >>indexing about 30,000 + files. It is roughly taking around 6-7 hours to > >>do this. I am using IndexHTML class to create the index out of HTML files. > >> > >>Another issue that I see is every once in a while I get the following > >>output on the screen. > >> > >>adding ../31/1104852.html > >>Parse Aborted: Encountered "\"" at line 7, column 1. > >>Was expecting one of: > >> <ArgName> ... > >> "=" ... > >> <TagEnd> ... > >> > >>Any suggestions on preventing this from happening? > >> > >>Thanks in advance. > >>-H > >> > >> > >>--------------------------------------------------------------------- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]