Hi; I am using the HTMLParser that comes with the latest version of Lucene (in the demo).
Here is the import line: import org.apache.lucene.demo.html.HTMLParser; If you have lucene-demos-1.4-final.jar in your class path the system will find the Parser Class. I am happy with the results. Let me know if you need anything else. L ----- Original Message ----- From: "sergiu gordea" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, November 12, 2004 3:39 AM Subject: Re: HTMLParser.getReader returning null > Luke Shannon wrote: > > Hi, > > May I ask you which library you are using for parsing html pages? > I need to index html pages and I want to use a good parser to > eliminate the html tags. > Can you recomend me a simple parser that has a demo? > > Thanks, > > Sergiu > > >Hello; > > > >Things were working fine. I have been re-organizing my code to drop into QA > >when I noticed I was no longer getting search results for my HTML files. > >When I checked things out I confirmed I was still creating the Documents but > >realized no content was being indexed. > > > > HTMLParser parser = new HTMLParser(f); > > > > // Add the tag-stripped contents as a Reader-valued Text field so it > >will > > // get tokenized and indexed. > > doc.add(Field.Text("contents", parser.getReader())); > > System.out.println("The content is " + doc.get("contents")); > > > >The SOP line above outputs a null where the contents used to be. Any seen > >this before? > > > >Thanks, > > > >Luke > > > >----- Original Message ----- > >From: "Will Allen" <[EMAIL PROTECTED]> > >To: "Lucene Users List" <[EMAIL PROTECTED]> > >Sent: Thursday, November 11, 2004 1:59 PM > >Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > > > >Any wildcard search will automatically expand your query to the number of > >terms it find in the index that suit the wildcard. > > > >For example: > > > >wild*, would become wild OR wilderness OR wildman etc for each of the terms > >that exist in your index. > > > >It is because of this, that you quickly reach the 1024 limit of clauses. I > >automatically set it to max int with the following line: > > > >BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); > > > > > >-----Original Message----- > >From: Sanyi [mailto:[EMAIL PROTECTED] > >Sent: Thursday, November 11, 2004 6:46 AM > >To: [EMAIL PROTECTED] > >Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > > > >Hi! > > > >First of all, I've read about BooleanQuery$TooManyClauses, so I know that it > >has a 1024 Clauses > >limit by default which is good enough for me, but I still think it works > >strange. > > > >Example: > >I have an index with about 20Million documents. > >Let's say that there is about 3000 variants in the entire document set of > >this word mask: cab* > >Let's say that about 500 documents are containing the word: spectrum > >Now, when I search for "cab* AND spectrum", I don't expect it to throw an > >exception. > >It should first restrict the search for the 500 documents containing the > >word "spectrum", then it > >should collect the variants of "cab*" withing these documents, which turns > >out in two or three > >variants of "cab*" (cable, cables, maybe some more) and the search should > >return let's say 10 > >documents. > > > >Similar example: When I search for "cab* AND nonexistingword" it still > >throws a TooManyClauses > >exception instead of saying "No results", since there is no > >"nonexistingword" in my document set, > >so it doesn't even have to start collecting the variations of "cab*". > > > >Is there any path for this issue? > >Thank you for your time! > > > >Sanyi > >(I'm using: lucene 1.4.2) > > > >p.s.: Sorry for re-sending this message, I was first sending it as an > >accidental reply to a wrong thread.. > > > > > > > >__________________________________ > >Do you Yahoo!? > >Check out the new Yahoo! Front Page. > >www.yahoo.com > > > > > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]