This has nothing to do with Lucene, but as I have written something very similar I'm taking the bait. You're best of using XPath or similar XML/HTML query language to parse the product specs, prices or whatever you're after. Each webshop you're indexing will have its own set of query expressions for extracting the data you need. So, extract the data with a query language and then write a clean Lucene index with the parsed data.
http://www.nabble.com/w3.org---www-xpath-comments-f11758.html http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/package-summary.html On Tue, Jan 13, 2009 at 6:29 PM, ppuyen <[email protected]> wrote: > > hi everyone, > I run example Indexing files HTML from "Lucene in Action " . > there can getTitle and getBody of file HTML . > > protected String getTitle(Element rawDoc) { > if (rawDoc == null) { > return null; > } > //System.out.println("getTitle"); > String title = ""; > NodeList children = rawDoc.getElementsByTagName("title"); > if (children.getLength() > 0) { > Element titleElement = ((Element) children.item(0)); > Text text = (Text) titleElement.getFirstChild(); > if (text != null) { > title = text.getData(); > } > } > System.out.println("getTitle:"+ title); > return title; > } > > > My project is commercial search engine. it's mean. when i find one product > (example Nokia N72 ) . after click button "Submit" , the result need show > name of product and Price each shop. > I run file Indexing file HTML , there're can getTitle and getBody. > My problem now is get Class ( example : < b class="Price"> $40 < /b> ) . > but each web's Class name is different . > Help me how could i do ? > thanks so much. > > > -- > View this message in context: > http://www.nabble.com/Get-element-Class-DOM-%21%21%21%21-tp21440434p21440434.html > Sent from the Lucene - General mailing list archive at Nabble.com. > >
